1

I'm looking for a method to concatenate multiple columns into a new one for duplicate rows with pandas. Here is the code I have so far :

import pandas as pd
df = pd.DataFrame([
 {"col1": 1, "col2": 2, "col3": 3, "col4": 4, "col5": 5},
 {"col1": 1, "col2": 2, "col3": 3, "col4": 6, "col5": 7},
 {"col1": 1, "col2": 2, "col3": 3, "col4": 8, "col5": 8},
 {"col1": 1, "col2": 2, "col3": 10, "col4": 100, "col5": 101},
 {"col1": 1, "col2": 2, "col3": 10, "col4": 100, "col5": 102},
 {"col1": 1, "col2": 2, "col3": 10, "col4": 100, "col5": 100},
])
def f(x, y, z):
 return list({x, y, z})
new_df_rows = []
# !!!
# should merge all rows of same duplicated(col1, col2, col3) into one row with a new field containing the set of all values from "col3", "col4" and "col5"
# the following works but I'm using a lot of hacks
# !!!
df_duplicated = df[df.duplicated(["col1", "col2", "col3"], keep=False)]
df_duplicated_groupby = df_duplicated.groupby(["col1", "col2", "col3"])
group_names = df_duplicated_groupby.groups.keys()
for group_name in group_names:
 group = df_duplicated_groupby.get_group(group_name)
 print(group_name)
 print(group)
 new_col = list({x for l in [f(row[0], row[1], row[2]) for row in group[['col3', "col4",'col5']].to_numpy()] for x in l})
 
 new_df_rows.append({
 "col1": group_name[0],
 "col2": group_name[1],
 "col3": group_name[2],
 "new_col": new_col
 })
new_df = pd.DataFrame(new_df_rows)
print(new_df.to_string())

Expected results :

 col1 col2 col3 new_col
0 1 2 3 [3, 4, 5, 6, 7, 8]
1 1 2 10 [10, 100, 101, 102]

Is there a more concise and/or faster method with pandas to achieve that ?

asked Aug 24, 2022 at 12:06

2 Answers 2

2

You can do this with groupby apply and numpy unique:

new_df = (df.groupby([df.col1, df.col2, df.col3])[['col3', 'col4', 'col5']]
 .apply(lambda x: np.unique(x).tolist())
 .rename('new_column')
 .reset_index())

Result:

 col1 col2 col3 new_col
0 1 2 3 [3, 4, 5, 6, 7, 8]
1 1 2 10 [10, 100, 101, 102]
answered Aug 24, 2022 at 12:48

Comments

1

First create the newcol with values as a list:

df['newcol'] = df[['col3','col4','col5']].values.tolist()
 col1 col2 col3 col4 col5 newcol
0 1 2 3 4 5 [3, 4, 5]
1 1 2 3 6 7 [3, 6, 7]
2 1 2 3 8 8 [3, 8, 8]
3 1 2 10 100 101 [10, 100, 101]
4 1 2 10 100 102 [10, 100, 102]
5 1 2 10 100 100 [10, 100, 100]

Then apply groupby on first three columns with sum operation on newcol:

newdf = df.groupby(['col1','col2','col3']).agg({'newcol': 'sum'}).reset_index()

Remove duplicates:

newdf['newcol'] = newdf['newcol'].apply(lambda x:list(set(x)))

Output:

 col1 col2 col3 newcol
0 1 2 3 [3, 4, 5, 6, 7, 8]
1 1 2 10 [10, 100, 101, 102]
answered Aug 24, 2022 at 12:52

1 Comment

Simple but fantastic answer, it also does the same for non duplicate rows which was done separately in my script. Thanks a lot !

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.