7
\$\begingroup\$

The task is basically this:

I am given the following csv file with lots of duplicate email addresses

Display Name,First Name,Last Name,Phone Number,Email Address,Login Date,Registration Date
John,John,Doe,99999999,[email protected],4/20/2015 21:56,4/20/2015 21:56
John,John,DOE,99999999,[email protected],3/31/2015 14:05,3/31/2015 14:05

I need to remove duplicates based on email address with the following conditions:

  1. The row with the latest login date must be selected.
  2. The oldest registration date among the rows must be used.

I used Python/pandas to do this.

How do I optimize the for loop in this pandas script using groupby? I tried hard but I'm still banging my head against it.

import pandas as pd
df = pd.read_csv('pra.csv')
# first sort the data by Login Date since we always need the latest Login date first
# we're making a copy so as to keep the original data intact, while still being able to sort by datetime
df['Login Date Copy'] = pd.to_datetime(df['Login Date'])
df['Registration Date Copy'] = pd.to_datetime(df['Registration Date'])
# this way latest login date appears first for each duplicate pair
df = df.sort_values(by='Login Date Copy', ascending=False)
output_df = pd.DataFrame()
# get rows for each email address and replace registration date with the oldest one
# this can probably be optimized using groupby
for email in df['Email Address'].unique():
 subdf = df.loc[df['Email Address'] == email]
 oldest_date = subdf['Registration Date Copy'].min()
 # get corresponding registration date for a particular registration date copy
 oldest_reg_date = df[df['Registration Date Copy'] == oldest_date]['Registration Date'].values[0]
 subdf['Registration Date'] = oldest_reg_date
 output_df = output_df.append(subdf)
# drop working columns
output_df.drop(['Login Date Copy', 'Registration Date Copy'], axis=1, inplace=True)
# finally, get only the first of the duplicates and output the result
output_df = output_df.drop_duplicates(subset='Email Address', keep='first')
output_df.to_csv('~/output.csv', index=False)
asked Aug 24, 2017 at 12:14
\$\endgroup\$
0

1 Answer 1

3
\$\begingroup\$

Using pandas.DataFrame.groupby is pretty much the same as what you are doing using the first line in your for loop. So you can change:

for email in df['Email Address'].unique():
 subdf = df.loc[df['Email Address'] == email]
 ...

by

for email, subdf in df.groupby('Email Address'):
 ...

Now, to improve this loop:

  • subdf['Registration Date Copy'].min() is not what you want since it gives you the value and you are interested by its index (as you are filtering and getting values[0] on the next line). Use argmin() instead to get the index so you can directly get the desired value with subdf['Registration Date'][<variable_name_you_stored_argmin_into>]
  • You can modify a specific cell of a DataFrame using df.set_value(index, column, value), no need to use an intermediate dataframe here. Since you will drop everything but the firsts elements of each group, you can change only the ones at subdf.index[0].

This yield:

df = pd.read_csv('pra.csv')
# Sort the data by Login Date since we always need the latest
# Login date first. We're making a copy so as to keep the
# original data intact, while still being able to sort by datetime
df['Login Date Copy'] = pd.to_datetime(df['Login Date'])
df['Registration Date Copy'] = pd.to_datetime(df['Registration Date'])
df.sort_values(by='Login Date Copy', ascending=False, inplace=True)
# Change the registration date of the first duplicate to the oldest one
for email, group in df.groupby('Email Address'):
 oldest_index = group['Registration Date Copy'].argmin()
 oldest_registration = group['Registration Date'][oldest_index]
 df.set_value(group.index[0], 'Registration Date', oldest_registration)
# Drop working columns
df.drop(['Login Date Copy', 'Registration Date Copy'], axis=1, inplace=True)
# Finally, get only the first of the duplicates and output the result
df.drop_duplicates(subset='Email Address', keep='first', inplace=True)
df.to_csv('~/output.csv', index=False)

I also changed a bit the formating of the comments and used inplace=True instead of df = df.whatever().


Finally, you may consider wrapping that in a function so you can easily change the values of the input/output parameters:

import pandas as pd
def main(input_csv, output_csv):
 df = pd.read_csv(input_csv)
 # Sort the data by Login Date since we always need the latest
 # Login date first. We're making a copy so as to keep the
 # original data intact, while still being able to sort by datetime
 df['Login Date Copy'] = pd.to_datetime(df['Login Date'])
 df['Registration Date Copy'] = pd.to_datetime(df['Registration Date'])
 df.sort_values(by='Login Date Copy', ascending=False, inplace=True)
 # Change the registration date of the first duplicate to the oldest one
 for email, group in df.groupby('Email Address'):
 oldest_index = group['Registration Date Copy'].argmin()
 oldest_registration = group['Registration Date'][oldest_index]
 df.set_value(group.index[0], 'Registration Date', oldest_registration)
 # Drop working columns
 df.drop(['Login Date Copy', 'Registration Date Copy'], axis=1, inplace=True)
 # Finally, get only the first of the duplicates and output the result
 df.drop_duplicates(subset='Email Address', keep='first', inplace=True)
 df.to_csv(output_csv, index=False)
if __name__ == '__main__':
 main('pra.csv', '~/output.csv')
answered Aug 25, 2017 at 8:23
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.