Concatenate multiple columns into a new column for duplicate rows with pandas

Question 1

I'm looking for a method to concatenate multiple columns into a new one for duplicate rows with pandas. Here is the code I have so far :

import pandas as pd
df = pd.DataFrame([
 {"col1": 1, "col2": 2, "col3": 3, "col4": 4, "col5": 5},
 {"col1": 1, "col2": 2, "col3": 3, "col4": 6, "col5": 7},
 {"col1": 1, "col2": 2, "col3": 3, "col4": 8, "col5": 8},
 {"col1": 1, "col2": 2, "col3": 10, "col4": 100, "col5": 101},
 {"col1": 1, "col2": 2, "col3": 10, "col4": 100, "col5": 102},
 {"col1": 1, "col2": 2, "col3": 10, "col4": 100, "col5": 100},
])
def f(x, y, z):
 return list({x, y, z})
new_df_rows = []
# !!!
# should merge all rows of same duplicated(col1, col2, col3) into one row with a new field containing the set of all values from "col3", "col4" and "col5"
# the following works but I'm using a lot of hacks
# !!!
df_duplicated = df[df.duplicated(["col1", "col2", "col3"], keep=False)]
df_duplicated_groupby = df_duplicated.groupby(["col1", "col2", "col3"])
group_names = df_duplicated_groupby.groups.keys()
for group_name in group_names:
 group = df_duplicated_groupby.get_group(group_name)
 print(group_name)
 print(group)
 new_col = list({x for l in [f(row[0], row[1], row[2]) for row in group[['col3', "col4",'col5']].to_numpy()] for x in l})
 
 new_df_rows.append({
 "col1": group_name[0],
 "col2": group_name[1],
 "col3": group_name[2],
 "new_col": new_col
 })
new_df = pd.DataFrame(new_df_rows)
print(new_df.to_string())

Expected results :

 col1 col2 col3 new_col
0 1 2 3 [3, 4, 5, 6, 7, 8]
1 1 2 10 [10, 100, 101, 102]

Is there a more concise and/or faster method with pandas to achieve that ?

Question 2

You can do this with groupby apply and numpy unique:

new_df = (df.groupby([df.col1, df.col2, df.col3])[['col3', 'col4', 'col5']]
 .apply(lambda x: np.unique(x).tolist())
 .rename('new_column')
 .reset_index())

Result:

 col1 col2 col3 new_col
0 1 2 3 [3, 4, 5, 6, 7, 8]
1 1 2 10 [10, 100, 101, 102]

Question 3

First create the newcol with values as a list:

df['newcol'] = df[['col3','col4','col5']].values.tolist()
 col1 col2 col3 col4 col5 newcol
0 1 2 3 4 5 [3, 4, 5]
1 1 2 3 6 7 [3, 6, 7]
2 1 2 3 8 8 [3, 8, 8]
3 1 2 10 100 101 [10, 100, 101]
4 1 2 10 100 102 [10, 100, 102]
5 1 2 10 100 100 [10, 100, 100]

Then apply groupby on first three columns with sum operation on newcol:

newdf = df.groupby(['col1','col2','col3']).agg({'newcol': 'sum'}).reset_index()

Remove duplicates:

newdf['newcol'] = newdf['newcol'].apply(lambda x:list(set(x)))

Output:

 col1 col2 col3 newcol
0 1 2 3 [3, 4, 5, 6, 7, 8]
1 1 2 10 [10, 100, 101, 102]

Question 4

Simple but fantastic answer, it also does the same for non duplicate rows which was done separately in my script. Thanks a lot !

Stef Stef 30.8k3 gold badges34 silver badges60 bronze badges · Accepted Answer · 2022-08-24 12:48:22Z

You can do this with groupby apply and numpy unique:

new_df = (df.groupby([df.col1, df.col2, df.col3])[['col3', 'col4', 'col5']]
 .apply(lambda x: np.unique(x).tolist())
 .rename('new_column')
 .reset_index())

Result:

 col1 col2 col3 new_col
0 1 2 3 [3, 4, 5, 6, 7, 8]
1 1 2 10 [10, 100, 101, 102]

CollectivesTM on Stack Overflow

Concatenate multiple columns into a new column for duplicate rows with pandas

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related