Remove duplicates from csv based on conditions

Question 1

The task is basically this:

I am given the following csv file with lots of duplicate email addresses

Display Name,First Name,Last Name,Phone Number,Email Address,Login Date,Registration Date
John,John,Doe,99999999,[email protected],4/20/2015 21:56,4/20/2015 21:56
John,John,DOE,99999999,[email protected],3/31/2015 14:05,3/31/2015 14:05

I need to remove duplicates based on email address with the following conditions:

The row with the latest login date must be selected.
The oldest registration date among the rows must be used.

I used Python/pandas to do this.

How do I optimize the for loop in this pandas script using groupby? I tried hard but I'm still banging my head against it.

import pandas as pd
df = pd.read_csv('pra.csv')
# first sort the data by Login Date since we always need the latest Login date first
# we're making a copy so as to keep the original data intact, while still being able to sort by datetime
df['Login Date Copy'] = pd.to_datetime(df['Login Date'])
df['Registration Date Copy'] = pd.to_datetime(df['Registration Date'])
# this way latest login date appears first for each duplicate pair
df = df.sort_values(by='Login Date Copy', ascending=False)
output_df = pd.DataFrame()
# get rows for each email address and replace registration date with the oldest one
# this can probably be optimized using groupby
for email in df['Email Address'].unique():
 subdf = df.loc[df['Email Address'] == email]
 oldest_date = subdf['Registration Date Copy'].min()
 # get corresponding registration date for a particular registration date copy
 oldest_reg_date = df[df['Registration Date Copy'] == oldest_date]['Registration Date'].values[0]
 subdf['Registration Date'] = oldest_reg_date
 output_df = output_df.append(subdf)
# drop working columns
output_df.drop(['Login Date Copy', 'Registration Date Copy'], axis=1, inplace=True)
# finally, get only the first of the duplicates and output the result
output_df = output_df.drop_duplicates(subset='Email Address', keep='first')
output_df.to_csv('~/output.csv', index=False)

Question 2

Using pandas.DataFrame.groupby is pretty much the same as what you are doing using the first line in your for loop. So you can change:

for email in df['Email Address'].unique():
 subdf = df.loc[df['Email Address'] == email]
 ...

by

for email, subdf in df.groupby('Email Address'):
 ...

Now, to improve this loop:

subdf['Registration Date Copy'].min() is not what you want since it gives you the value and you are interested by its index (as you are filtering and getting values[0] on the next line). Use argmin() instead to get the index so you can directly get the desired value with subdf['Registration Date'][<variable_name_you_stored_argmin_into>]
You can modify a specific cell of a DataFrame using df.set_value(index, column, value), no need to use an intermediate dataframe here. Since you will drop everything but the firsts elements of each group, you can change only the ones at subdf.index[0].

This yield:

df = pd.read_csv('pra.csv')
# Sort the data by Login Date since we always need the latest
# Login date first. We're making a copy so as to keep the
# original data intact, while still being able to sort by datetime
df['Login Date Copy'] = pd.to_datetime(df['Login Date'])
df['Registration Date Copy'] = pd.to_datetime(df['Registration Date'])
df.sort_values(by='Login Date Copy', ascending=False, inplace=True)
# Change the registration date of the first duplicate to the oldest one
for email, group in df.groupby('Email Address'):
 oldest_index = group['Registration Date Copy'].argmin()
 oldest_registration = group['Registration Date'][oldest_index]
 df.set_value(group.index[0], 'Registration Date', oldest_registration)
# Drop working columns
df.drop(['Login Date Copy', 'Registration Date Copy'], axis=1, inplace=True)
# Finally, get only the first of the duplicates and output the result
df.drop_duplicates(subset='Email Address', keep='first', inplace=True)
df.to_csv('~/output.csv', index=False)

I also changed a bit the formating of the comments and used inplace=True instead of df = df.whatever().

Finally, you may consider wrapping that in a function so you can easily change the values of the input/output parameters:

import pandas as pd
def main(input_csv, output_csv):
 df = pd.read_csv(input_csv)
 # Sort the data by Login Date since we always need the latest
 # Login date first. We're making a copy so as to keep the
 # original data intact, while still being able to sort by datetime
 df['Login Date Copy'] = pd.to_datetime(df['Login Date'])
 df['Registration Date Copy'] = pd.to_datetime(df['Registration Date'])
 df.sort_values(by='Login Date Copy', ascending=False, inplace=True)
 # Change the registration date of the first duplicate to the oldest one
 for email, group in df.groupby('Email Address'):
 oldest_index = group['Registration Date Copy'].argmin()
 oldest_registration = group['Registration Date'][oldest_index]
 df.set_value(group.index[0], 'Registration Date', oldest_registration)
 # Drop working columns
 df.drop(['Login Date Copy', 'Registration Date Copy'], axis=1, inplace=True)
 # Finally, get only the first of the duplicates and output the result
 df.drop_duplicates(subset='Email Address', keep='first', inplace=True)
 df.to_csv(output_csv, index=False)
if __name__ == '__main__':
 main('pra.csv', '~/output.csv')

score 3 · Accepted Answer · 2017-08-25 08:23:12Z

Using pandas.DataFrame.groupby is pretty much the same as what you are doing using the first line in your for loop. So you can change:

for email in df['Email Address'].unique():
 subdf = df.loc[df['Email Address'] == email]
 ...

by

for email, subdf in df.groupby('Email Address'):
 ...

Now, to improve this loop:

subdf['Registration Date Copy'].min() is not what you want since it gives you the value and you are interested by its index (as you are filtering and getting values[0] on the next line). Use argmin() instead to get the index so you can directly get the desired value with subdf['Registration Date'][<variable_name_you_stored_argmin_into>]
You can modify a specific cell of a DataFrame using df.set_value(index, column, value), no need to use an intermediate dataframe here. Since you will drop everything but the firsts elements of each group, you can change only the ones at subdf.index[0].

This yield:

df = pd.read_csv('pra.csv')
# Sort the data by Login Date since we always need the latest
# Login date first. We're making a copy so as to keep the
# original data intact, while still being able to sort by datetime
df['Login Date Copy'] = pd.to_datetime(df['Login Date'])
df['Registration Date Copy'] = pd.to_datetime(df['Registration Date'])
df.sort_values(by='Login Date Copy', ascending=False, inplace=True)
# Change the registration date of the first duplicate to the oldest one
for email, group in df.groupby('Email Address'):
 oldest_index = group['Registration Date Copy'].argmin()
 oldest_registration = group['Registration Date'][oldest_index]
 df.set_value(group.index[0], 'Registration Date', oldest_registration)
# Drop working columns
df.drop(['Login Date Copy', 'Registration Date Copy'], axis=1, inplace=True)
# Finally, get only the first of the duplicates and output the result
df.drop_duplicates(subset='Email Address', keep='first', inplace=True)
df.to_csv('~/output.csv', index=False)

I also changed a bit the formating of the comments and used inplace=True instead of df = df.whatever().

Finally, you may consider wrapping that in a function so you can easily change the values of the input/output parameters:

import pandas as pd
def main(input_csv, output_csv):
 df = pd.read_csv(input_csv)
 # Sort the data by Login Date since we always need the latest
 # Login date first. We're making a copy so as to keep the
 # original data intact, while still being able to sort by datetime
 df['Login Date Copy'] = pd.to_datetime(df['Login Date'])
 df['Registration Date Copy'] = pd.to_datetime(df['Registration Date'])
 df.sort_values(by='Login Date Copy', ascending=False, inplace=True)
 # Change the registration date of the first duplicate to the oldest one
 for email, group in df.groupby('Email Address'):
 oldest_index = group['Registration Date Copy'].argmin()
 oldest_registration = group['Registration Date'][oldest_index]
 df.set_value(group.index[0], 'Registration Date', oldest_registration)
 # Drop working columns
 df.drop(['Login Date Copy', 'Registration Date Copy'], axis=1, inplace=True)
 # Finally, get only the first of the duplicates and output the result
 df.drop_duplicates(subset='Email Address', keep='first', inplace=True)
 df.to_csv(output_csv, index=False)
if __name__ == '__main__':
 main('pra.csv', '~/output.csv')

Stack Exchange Network

Remove duplicates from csv based on conditions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Remove duplicates from csv based on conditions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions