I would do this in SQL using string_agg but the server is SQL Server 2012 and beyond my control. So I'm trying a python approach.
I have a dataframe of shape [20225 rows x 7 columns], and there a bit of transformation required. There are sometimes duplicate rows, but only in one column. So what I want to do is find the duplicate rows (where the name is the same) and then
- Concatenate all the email addresses in three columns and name matching rows into one string (dropping nulls)
- Concatenate all the company names in three columns and name matching rows into one string (dropping nulls)
- Create a new dataframe of shape [20106 rows x 3 columns] that then has one row per name, with a single string of email addresses in the second column, and a single string of companies in the third column.
Basically, the duplicate rows have been eliminated, and the different email addresses/companynames have been concatenated.
My code works, and takes about 6 minutes to run... I don't know enough about this, but I have a hunch it could be a lot faster. I'm just looking for some pointers as to maybe structuring it differently? Thanks for any guidance.
EXAMPLE DATA
Name | People1.Email | People1.CompanyName | People2.Email | People2.CompanyName | People3.Email | People3.CompanyName |
---|---|---|---|---|---|---|
Person A | [email protected] | CompanyName | [email protected] | CompanyName | ||
Person A | [email protected] | CompanyName | [email protected] | CompanyName | ||
Person B | [email protected] | CompanyName | [email protected] | CompanyName | ||
Person C | [email protected] | CompanyName | [email protected] | CompanyName | [email protected] CompanyName | |
Person D | [email protected] | CompanyName | ||||
Person D | [email protected] | CompanyName | ||||
Person D | [email protected] | CompanyName | [email protected] | CompanyName | ||
Person E | [email protected] | CompanyName | [email protected] | CompanyName | [email protected] | CompanyName |
Person E | [email protected] | CompanyName | [email protected] | CompanyName | [email protected] | CompanyName |
Name | Emails | Companies |
---|---|---|
Person A | [email protected];[email protected];[email protected];[email protected] | CompanyName; CompanyName;CompanyName; CompanyName |
Person B | [email protected];[email protected] | CompanyName; CompanyName;CompanyName |
etc |
*DATA TYPES*
Name object
People1.Email object
People2.CompanyName object
People1.Email object
People2.CompanyName object
People3.Email object
People4.CompanyName object
*CODE*
print (time.strftime("%H:%M:%S", time.localtime()) + " start")
pd_xl_file = pd.ExcelFile(r'C:\sample.xlsx')
df = pd_xl_file.parse(0)
listOfPeople = df['Name'].unique().tolist()
# Now creata new df to hold the final result
df_new = pd.DataFrame()
for person in listOfPeople:
lstCompanies =
df.loc[df['Name'] == person, 'People1.CompanyName'].unique().tolist() +
df.loc[df['Name'] == person, 'People2.CompanyName'].unique().tolist() +
df.loc[df['Name'] == person, 'People3.CompanyName'].unique().tolist()
Companies = [x for x in lstCompanies if pd.isnull(x) == False]
lstEmails =
df.loc[df['Name'] == person, 'People1.Email'].unique().tolist() +
df.loc[df['Name'] == person, 'People2.Email'].unique().tolist() +
df.loc[df['Name'] == person, 'People3.Email'].unique().tolist()
Emails = [x for x in lstEmails if pd.isnull(x) == False]
# initialize list of lists
c = ' '.join([item for item in Companies])
e = ' '.join([item for item in Emails])
# append to the final result
new_row = pd.DataFrame({'Name':person, 'Companies':c, 'Emails':e}, index=[0])
df_new = pd.concat([new_row,df_new.loc[:]]).reset_index(drop=True)
print ('.', end='')
print (df_new)
print (time.strftime("%H:%M:%S", time.localtime()) + " end")
1 Answer 1
Don't for in listOfPeople
, and don't tolist
.
Your data are misshapen. There should not be multiple Email
and CompanyName
columns; there should only be one each.
Group by the name, and then aggregate using a string join.
Suggested
import pandas as pd
df = pd.read_csv('278083.csv', index_col='Name')
to_concat = []
for i in range(1, df.shape[1]//2 + 1):
email = f'People{i}.Email'
company = f'People{i}.CompanyName'
sub = (
df[[email, company]]
.dropna()
.rename({
email: 'Email',
company: 'Company'
}, axis='columns')
)
sub['Contact'] = i
to_concat.append(sub)
df = pd.concat(to_concat).set_index(keys='Contact', append=True)
join = ';'.join
combined = df.groupby('Name').agg({
'Email': join, 'Company': join,
})
-
\$\begingroup\$ Thanks, but the misshapen data is my reality. This is people data, right? I have a single "person" described with more than one email and more than one company name, in fact three of each. So I am trying to merge them into a searchable string. \$\endgroup\$Maxcot– Maxcot2022年07月15日 01:35:52 +00:00Commented Jul 15, 2022 at 1:35
-
\$\begingroup\$ The suggested code handles this by reshaping to a single email column with multiple values. If you can't control the format of the data, you should process it into this form. \$\endgroup\$Reinderien– Reinderien2022年07月15日 02:22:29 +00:00Commented Jul 15, 2022 at 2:22
-
\$\begingroup\$ Thanks. After revisiting the creation of the data, I was able to sort it out. Beautiful solution ... does the job in under 10 sec. Much appreciated. If you have the time, what is this syntax"
df.shape[1]//2 + 1
mean? \$\endgroup\$Maxcot– Maxcot2022年07月15日 03:26:15 +00:00Commented Jul 15, 2022 at 3:26 -
\$\begingroup\$ Take the number of columns and floor divide by two \$\endgroup\$Reinderien– Reinderien2022年07月15日 11:15:28 +00:00Commented Jul 15, 2022 at 11:15