Python Pandas Dataframe code takes too long to finish

Question 1

I have a Pandas DataFrame with users subscription dates in the following format:

UserId, StartingDate, EndingDate

And I try to calculate the Churn Rate metric for every day.

What Churn Rate is:

The churn rate, also known as the rate of attrition, is the percentage of subscribers to a service who discontinue their subscriptions to that service within a given time period.

So, for every day, I go back 1 month, I get a list of unique users that had an active subscription and check how many of those don't have it any more.

I wrote the code, but it takes ages to finish, so I am looking for any kind of performance issues that might have

import pandas as pd
from datetime import datetime
from datetime import timedelta
df = pd.read_csv("subscritpions.csv")
#make sure both columns are in datetime type
df['StartingDate'] = pd.to_datetime(df['StartingDate'])
df['EndingDate'] = pd.to_datetime(df['EndingDate'])
#get the first date of the dataframe to start the loop with it and set the stop date as today
start = pd.to_datetime(df.StartingDate.min())
minDate = start
stop = datetime.now()
def getUsersFromADate(df,date):
 return df.loc[(df['StartingDate'] <= date) & (df['EndingDate'] >= date)].UserId.unique()
churn = []
while start <= stop:
 # first 30 days doesn't have a churn rate. So just append a 0 value
 if start < minDate + pd.DateOffset(months=1):
 churn.append(0)
 else:
 usersBefore = getUsersFromADate(df, start - pd.DateOffset(months=1))
 usersNow = getUsersFromADate(df, start)
 lost = 0
 for u in usersBefore:
 if u not in usersNow:
 lost += 1
 churn.append(lost/len(usersBefore))
 start = start + timedelta(days=1) # increase day one by one

Example of my data:

 UserId StartingDate EndingDate
0 1 2013年05月09日 2015年04月24日
1 1 2015年04月29日 2017年04月02日
2 1 2017年04月05日 2017年12月06日
3 2 2014年02月13日 2018年02月07日
4 3 2013年04月25日 2018年04月19日

Question 2

Making a query for every day can get very expensive if the range gets large. How large can this range be? And how many records are expected on the csv?

Question 3

First date is at mid 2013. The whole file is about 7mb. Not at laptop now to check the exact number of lines. But it should be less than 500k

Question 4

You could do the entire thing in pandas and numpy as their Cython implementation will be much faster than iterating over objects in Python.

First simulate some data,

import pandas as pd, numpy as np
from datetime import datetime
num_samples = 50000
user_ids = np.arange(num_samples)
dates = pd.date_range("2012-01-01","2015-01-01")
start_dates = dates[np.random.randint(0,len(dates),50000)]
data = pd.DataFrame(data={"user_id": user_ids, "start_date":start_dates})
data["end_date"] = data.start_date.apply(lambda x: x + pd.DateOffset(days=np.random.randint(0,300)))

which results in data:

start_date user_id end_date
2013年12月15日 0 2014年09月24日 
2013年12月13日 1 2014年01月17日
2014年08月29日 2 2015年03月25日
2014年04月13日 3 2015年01月04日 
2014年01月21日 4 2014年06月22日

We need an output dataframe so that we don't have to painfully iterate over each date one by one:

output = pd.DataFrame({"date":pd.date_range(data.start_date.min(), datetime.now())})

and now define your functions to calculate the churn rate and apply it over the entire data frame:

def get_users_from_date(df, date):
 return df[(date >= df.start_date) & (date <= df.end_date)].user_id
def calc_churn(df, date):
 users_before = get_users_from_date(df,date-pd.DateOffset(months=1))
 users_now = get_users_from_date(df,date)
 if len(users_before):
 return len(np.setdiff1d(users_before,users_now)) / len(users_before)
 else:
 return 0
output["churn"] = output.date.apply(lambda x: calc_churn(data,x))

This finishes for me in just:

5.42 s ± 65.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Since we are using np.setdiff1d to find the difference, we don't need to unique each subset of user ids beforehand. Finding the unique subsets in get_user_from_date then passing them in, even with assume_unique=True, gets a time of:

5.98 s ± 75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

while using Python set difference instead of numpy arrays gets a time of:

7.5 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

and finally, combining Python sets with list comprehensions, similar to your current implementation never finished running.

The main bottleneck is in comparing the sets of user ids, so doing that in either numpy or pandas instead of iterating over Python objects will improve performance.

On a different note, your code looks good and is really easy to read but you should use snake case instead of camel case to fit Python convention.

Question 5

I have this error when I try to apply your code TypeError: Cannot compare type 'Timestamp' with type 'str'

Question 6

@Tasos I did find a typo but that was unrelated. Copy pasting the code and running it from here also ran without any errors. Do you know what line you're getting that error on?

Question 7

I think it is this line: return df[(date >= df.StartingDate) & (date <= df.EndingDate)].UserId.. I renamed columns based my data

Question 8

@Tasos can you check if output.date, data.StartingDate and data.EndingDate are all datetime objects? What parts did you edit exactly?

Question 9

You are right! My original dataframe didn't have datetime column types.

Question 10

Here is an alternative way of calculating the churn rates which would be optimal efficient wise without altering the data you have. Note that maybe pandas queries may outperform python loops used here so your solution could run faster even though its less efficient.

Also, I am assuming no user can have multiple subscriptions on a same day (intervals for user subscription don ́t overlap) and that an user startDate - endDate range is at least 1 month

For every row in the csv, generate 2 events: subscribed with date = start date and unsubscribed with date = end date
Sort all these events by date in ascending order
Set currentDate = event[0].date, usersSubscribedByDay = {currentDate: 0} and usersUnsubscribedByDay = {currentDate: 0}

Preprocess data: keep amount of unsubscriptions total up to each day and amount of actually subscribed users on each day

for event in events:
 while event.date != currentDate: # we reached a new day (check only date, not datetime)
 usersSubscribedByDay[currentDate + 1 day] = usersSubscribedByDay[currentDate]
 usersUnsubscribedByDay[currentDate + 1 day] = usersUnsubscribedByDay[currentDate]
 currentDate = currentDate + 1 day
 if event.type == 'subscribed':
 usersSubscribedByDay[currentDate]+=1
 else:
 usersUnsubscribedByDay[currentDate]+=1
 usersSubscribedByDay[currentDate]-=1

Calculate churn rate

for day in sorted(usersSubscribedByDay.keys()):
 if day is on first month:
 churn.append(0)
 else:
 subscribers = usersSubscribedByDay[day - 1 month] # this gives the amount of users that were subscribed on this day
 lostSubscribers = usersUnsubscribedByDay[day] - usersUnsubscribedByDay[day - 1 month] # this gives the amount of unsubscriptions in the month. Here we are asumming that the unsubscriptions are not of a subscription that happened in the last month
 churn.append(lostSubscribers / subscribers)

mochi mochi 1,1445 silver badges7 bronze badges · Accepted Answer · 2017-12-05 09:02:12Z

You could do the entire thing in pandas and numpy as their Cython implementation will be much faster than iterating over objects in Python.

First simulate some data,

import pandas as pd, numpy as np
from datetime import datetime
num_samples = 50000
user_ids = np.arange(num_samples)
dates = pd.date_range("2012-01-01","2015-01-01")
start_dates = dates[np.random.randint(0,len(dates),50000)]
data = pd.DataFrame(data={"user_id": user_ids, "start_date":start_dates})
data["end_date"] = data.start_date.apply(lambda x: x + pd.DateOffset(days=np.random.randint(0,300)))

which results in data:

start_date user_id end_date
2013年12月15日 0 2014年09月24日 
2013年12月13日 1 2014年01月17日
2014年08月29日 2 2015年03月25日
2014年04月13日 3 2015年01月04日 
2014年01月21日 4 2014年06月22日

We need an output dataframe so that we don't have to painfully iterate over each date one by one:

output = pd.DataFrame({"date":pd.date_range(data.start_date.min(), datetime.now())})

and now define your functions to calculate the churn rate and apply it over the entire data frame:

def get_users_from_date(df, date):
 return df[(date >= df.start_date) & (date <= df.end_date)].user_id
def calc_churn(df, date):
 users_before = get_users_from_date(df,date-pd.DateOffset(months=1))
 users_now = get_users_from_date(df,date)
 if len(users_before):
 return len(np.setdiff1d(users_before,users_now)) / len(users_before)
 else:
 return 0
output["churn"] = output.date.apply(lambda x: calc_churn(data,x))

This finishes for me in just:

5.42 s ± 65.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Since we are using np.setdiff1d to find the difference, we don't need to unique each subset of user ids beforehand. Finding the unique subsets in get_user_from_date then passing them in, even with assume_unique=True, gets a time of:

5.98 s ± 75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

while using Python set difference instead of numpy arrays gets a time of:

7.5 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

and finally, combining Python sets with list comprehensions, similar to your current implementation never finished running.

The main bottleneck is in comparing the sets of user ids, so doing that in either numpy or pandas instead of iterating over Python objects will improve performance.

On a different note, your code looks good and is really easy to read but you should use snake case instead of camel case to fit Python convention.

I have this error when I try to apply your code TypeError: Cannot compare type 'Timestamp' with type 'str'
@Tasos I did find a typo but that was unrelated. Copy pasting the code and running it from here also ran without any errors. Do you know what line you're getting that error on?
I think it is this line: return df[(date >= df.StartingDate) & (date <= df.EndingDate)].UserId.. I renamed columns based my data
@Tasos can you check if output.date, data.StartingDate and data.EndingDate are all datetime objects? What parts did you edit exactly?
You are right! My original dataframe didn't have datetime column types.

Stack Exchange Network

Python Pandas Dataframe code takes too long to finish

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python Pandas Dataframe code takes too long to finish

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions