Slow processing of a python dataframe when aggregating across rows and columns

Question 1

I would do this in SQL using string_agg but the server is SQL Server 2012 and beyond my control. So I'm trying a python approach.

I have a dataframe of shape [20225 rows x 7 columns], and there a bit of transformation required. There are sometimes duplicate rows, but only in one column. So what I want to do is find the duplicate rows (where the name is the same) and then

Concatenate all the email addresses in three columns and name matching rows into one string (dropping nulls)
Concatenate all the company names in three columns and name matching rows into one string (dropping nulls)
Create a new dataframe of shape [20106 rows x 3 columns] that then has one row per name, with a single string of email addresses in the second column, and a single string of companies in the third column.

Basically, the duplicate rows have been eliminated, and the different email addresses/companynames have been concatenated.

My code works, and takes about 6 minutes to run... I don't know enough about this, but I have a hunch it could be a lot faster. I'm just looking for some pointers as to maybe structuring it differently? Thanks for any guidance.

EXAMPLE DATA

Name	People1.Email	People1.CompanyName	People2.Email	People2.CompanyName	People3.Email	People3.CompanyName
Person A	[email protected]	CompanyName	[email protected]	CompanyName
Person A	[email protected]	CompanyName	[email protected]	CompanyName
Person B	[email protected]	CompanyName	[email protected]	CompanyName
Person C	[email protected]	CompanyName	[email protected]	CompanyName	[email protected] CompanyName
Person D	[email protected]	CompanyName
Person D	[email protected]	CompanyName
Person D	[email protected]	CompanyName	[email protected]	CompanyName
Person E	[email protected]	CompanyName	[email protected]	CompanyName	[email protected]	CompanyName
Person E	[email protected]	CompanyName	[email protected]	CompanyName	[email protected]	CompanyName

Name	Emails	Companies
Person A	[email protected];[email protected];[email protected];[email protected]	CompanyName; CompanyName;CompanyName; CompanyName
Person B	[email protected];[email protected]	CompanyName; CompanyName;CompanyName
etc

*DATA TYPES*
Name object
People1.Email object
People2.CompanyName object
People1.Email object
People2.CompanyName object
People3.Email object
People4.CompanyName object
*CODE*
print (time.strftime("%H:%M:%S", time.localtime()) + " start")
pd_xl_file = pd.ExcelFile(r'C:\sample.xlsx')
df = pd_xl_file.parse(0)
listOfPeople = df['Name'].unique().tolist()
# Now creata new df to hold the final result 
df_new = pd.DataFrame()
for person in listOfPeople:
 lstCompanies = 
 df.loc[df['Name'] == person, 'People1.CompanyName'].unique().tolist() + 
 df.loc[df['Name'] == person, 'People2.CompanyName'].unique().tolist() +
 df.loc[df['Name'] == person, 'People3.CompanyName'].unique().tolist()
 Companies = [x for x in lstCompanies if pd.isnull(x) == False]
 lstEmails = 
 df.loc[df['Name'] == person, 'People1.Email'].unique().tolist() + 
 df.loc[df['Name'] == person, 'People2.Email'].unique().tolist() +
 df.loc[df['Name'] == person, 'People3.Email'].unique().tolist()
 Emails = [x for x in lstEmails if pd.isnull(x) == False]
 # initialize list of lists
 c = ' '.join([item for item in Companies])
 e = ' '.join([item for item in Emails])
 
 # append to the final result
 new_row = pd.DataFrame({'Name':person, 'Companies':c, 'Emails':e}, index=[0])
 df_new = pd.concat([new_row,df_new.loc[:]]).reset_index(drop=True)
 print ('.', end='')
print (df_new)
print (time.strftime("%H:%M:%S", time.localtime()) + " end")

Question 2

Don't for in listOfPeople, and don't tolist.

Your data are misshapen. There should not be multiple Email and CompanyName columns; there should only be one each.

Group by the name, and then aggregate using a string join.

Suggested

import pandas as pd
df = pd.read_csv('278083.csv', index_col='Name')
to_concat = []
for i in range(1, df.shape[1]//2 + 1):
 email = f'People{i}.Email'
 company = f'People{i}.CompanyName'
 sub = (
 df[[email, company]]
 .dropna()
 .rename({
 email: 'Email',
 company: 'Company'
 }, axis='columns')
 )
 sub['Contact'] = i
 to_concat.append(sub)
df = pd.concat(to_concat).set_index(keys='Contact', append=True)
join = ';'.join
combined = df.groupby('Name').agg({
 'Email': join, 'Company': join,
})

Question 3

Thanks, but the misshapen data is my reality. This is people data, right? I have a single "person" described with more than one email and more than one company name, in fact three of each. So I am trying to merge them into a searchable string.

Question 4

The suggested code handles this by reshaping to a single email column with multiple values. If you can't control the format of the data, you should process it into this form.

Question 5

Thanks. After revisiting the creation of the data, I was able to sort it out. Beautiful solution ... does the job in under 10 sec. Much appreciated. If you have the time, what is this syntax" df.shape[1]//2 + 1 mean?

Question 6

Take the number of columns and floor divide by two

score 1 · Accepted Answer · 2022-07-15 00:24:24Z

Don't for in listOfPeople, and don't tolist.

Your data are misshapen. There should not be multiple Email and CompanyName columns; there should only be one each.

Group by the name, and then aggregate using a string join.

Suggested

import pandas as pd
df = pd.read_csv('278083.csv', index_col='Name')
to_concat = []
for i in range(1, df.shape[1]//2 + 1):
 email = f'People{i}.Email'
 company = f'People{i}.CompanyName'
 sub = (
 df[[email, company]]
 .dropna()
 .rename({
 email: 'Email',
 company: 'Company'
 }, axis='columns')
 )
 sub['Contact'] = i
 to_concat.append(sub)
df = pd.concat(to_concat).set_index(keys='Contact', append=True)
join = ';'.join
combined = df.groupby('Name').agg({
 'Email': join, 'Company': join,
})

Thanks, but the misshapen data is my reality. This is people data, right? I have a single "person" described with more than one email and more than one company name, in fact three of each. So I am trying to merge them into a searchable string.
The suggested code handles this by reshaping to a single email column with multiple values. If you can't control the format of the data, you should process it into this form.
Thanks. After revisiting the creation of the data, I was able to sort it out. Beautiful solution ... does the job in under 10 sec. Much appreciated. If you have the time, what is this syntax" df.shape[1]//2 + 1 mean?

Stack Exchange Network

Slow processing of a python dataframe when aggregating across rows and columns

1 Answer 1

Suggested

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Slow processing of a python dataframe when aggregating across rows and columns

1 Answer 1

Suggested

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions