2
\$\begingroup\$

I would do this in SQL using string_agg but the server is SQL Server 2012 and beyond my control. So I'm trying a python approach.

I have a dataframe of shape [20225 rows x 7 columns], and there a bit of transformation required. There are sometimes duplicate rows, but only in one column. So what I want to do is find the duplicate rows (where the name is the same) and then

  1. Concatenate all the email addresses in three columns and name matching rows into one string (dropping nulls)
  2. Concatenate all the company names in three columns and name matching rows into one string (dropping nulls)
  3. Create a new dataframe of shape [20106 rows x 3 columns] that then has one row per name, with a single string of email addresses in the second column, and a single string of companies in the third column.

Basically, the duplicate rows have been eliminated, and the different email addresses/companynames have been concatenated.

My code works, and takes about 6 minutes to run... I don't know enough about this, but I have a hunch it could be a lot faster. I'm just looking for some pointers as to maybe structuring it differently? Thanks for any guidance.

EXAMPLE DATA

Name People1.Email People1.CompanyName People2.Email People2.CompanyName People3.Email People3.CompanyName
Person A [email protected] CompanyName [email protected] CompanyName
Person A [email protected] CompanyName [email protected] CompanyName
Person B [email protected] CompanyName [email protected] CompanyName
Person C [email protected] CompanyName [email protected] CompanyName [email protected] CompanyName
Person D [email protected] CompanyName
Person D [email protected] CompanyName
Person D [email protected] CompanyName [email protected] CompanyName
Person E [email protected] CompanyName [email protected] CompanyName [email protected] CompanyName
Person E [email protected] CompanyName [email protected] CompanyName [email protected] CompanyName
Name Emails Companies
Person A [email protected];[email protected];[email protected];[email protected] CompanyName; CompanyName;CompanyName; CompanyName
Person B [email protected];[email protected] CompanyName; CompanyName;CompanyName
etc
*DATA TYPES*
Name object
People1.Email object
People2.CompanyName object
People1.Email object
People2.CompanyName object
People3.Email object
People4.CompanyName object
*CODE*
print (time.strftime("%H:%M:%S", time.localtime()) + " start")
pd_xl_file = pd.ExcelFile(r'C:\sample.xlsx')
df = pd_xl_file.parse(0)
listOfPeople = df['Name'].unique().tolist()
# Now creata new df to hold the final result 
df_new = pd.DataFrame()
for person in listOfPeople:
 lstCompanies = 
 df.loc[df['Name'] == person, 'People1.CompanyName'].unique().tolist() + 
 df.loc[df['Name'] == person, 'People2.CompanyName'].unique().tolist() +
 df.loc[df['Name'] == person, 'People3.CompanyName'].unique().tolist()
 Companies = [x for x in lstCompanies if pd.isnull(x) == False]
 lstEmails = 
 df.loc[df['Name'] == person, 'People1.Email'].unique().tolist() + 
 df.loc[df['Name'] == person, 'People2.Email'].unique().tolist() +
 df.loc[df['Name'] == person, 'People3.Email'].unique().tolist()
 Emails = [x for x in lstEmails if pd.isnull(x) == False]
 # initialize list of lists
 c = ' '.join([item for item in Companies])
 e = ' '.join([item for item in Emails])
 
 # append to the final result
 new_row = pd.DataFrame({'Name':person, 'Companies':c, 'Emails':e}, index=[0])
 df_new = pd.concat([new_row,df_new.loc[:]]).reset_index(drop=True)
 print ('.', end='')
print (df_new)
print (time.strftime("%H:%M:%S", time.localtime()) + " end")
asked Jul 14, 2022 at 21:37
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

Don't for in listOfPeople, and don't tolist.

Your data are misshapen. There should not be multiple Email and CompanyName columns; there should only be one each.

Group by the name, and then aggregate using a string join.

Suggested

import pandas as pd
df = pd.read_csv('278083.csv', index_col='Name')
to_concat = []
for i in range(1, df.shape[1]//2 + 1):
 email = f'People{i}.Email'
 company = f'People{i}.CompanyName'
 sub = (
 df[[email, company]]
 .dropna()
 .rename({
 email: 'Email',
 company: 'Company'
 }, axis='columns')
 )
 sub['Contact'] = i
 to_concat.append(sub)
df = pd.concat(to_concat).set_index(keys='Contact', append=True)
join = ';'.join
combined = df.groupby('Name').agg({
 'Email': join, 'Company': join,
})
answered Jul 15, 2022 at 0:24
\$\endgroup\$
4
  • \$\begingroup\$ Thanks, but the misshapen data is my reality. This is people data, right? I have a single "person" described with more than one email and more than one company name, in fact three of each. So I am trying to merge them into a searchable string. \$\endgroup\$ Commented Jul 15, 2022 at 1:35
  • \$\begingroup\$ The suggested code handles this by reshaping to a single email column with multiple values. If you can't control the format of the data, you should process it into this form. \$\endgroup\$ Commented Jul 15, 2022 at 2:22
  • \$\begingroup\$ Thanks. After revisiting the creation of the data, I was able to sort it out. Beautiful solution ... does the job in under 10 sec. Much appreciated. If you have the time, what is this syntax" df.shape[1]//2 + 1 mean? \$\endgroup\$ Commented Jul 15, 2022 at 3:26
  • \$\begingroup\$ Take the number of columns and floor divide by two \$\endgroup\$ Commented Jul 15, 2022 at 11:15

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.