1

How can I perform the following operations on a pandas dataframe?

  1. combine text from one column, multiple rows into one row
  2. remove duplicates in the "one row"
  3. repeat 1 & 2 for multiple columns

Based on the following Stack Overflow questions and answers, I have made the attempted code below. The last attempt is close, but I do not know how to convert the set back into a string (i.e., remove the braces) and roll it into a lambda function that I can use applymap() for multiple columns.

  1. How to combine multiple rows into a single row with pandas [duplicate]
  2. Concatenate strings from several rows using Pandas groupby
  3. Remove duplicates from rows and columns (cell) in a dataframe, python

Example Dataframe

id = [1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4]
colA = ['type12', 'type11', 'type11', 'type11', 'type21', 'type21', 
 'type22', 'type23', 'type23', 'type23', 'type31', 'type31', 
 'type31', 'type31', 'type41', 'type41', 'type42', 'type41', 
 'type41', 'type43'
 ]
colB = ['Set A', 'Set B', 'Set B', 'Set B', 'Set B', 'Set B', 'Set A', 
 'Set B', 'Set C', 'Set C', 'Set B', 'Set C', 'Set B', 'Set C', 
 'Set B', 'Set B', 'Set A', 'Set C', 'Set B', 'Set A'
 ]
colC = ['alpha', 'beta', 'delta', 'charlie', 'beta', 'delta', 'alpha', 
 'charlie', 'charlie', 'delta', 'delta', 'charlie', 'beta', 
 'delta', 'beta', 'charlie', 'alpha', 'charlie', 'delta', 'alpha'
 ]
df = pd.DataFrame(list(zip(id, colA, colB, colC)), columns =['id', 'colA', 'colB', 'colC'])
print(df)
 id colA colB colC
0 1 type12 Set A alpha
1 1 type11 Set B beta
2 1 type11 Set B delta
3 1 type11 Set B charlie
4 2 type21 Set B beta
5 2 type21 Set B delta
6 2 type22 Set A alpha
7 2 type23 Set B charlie
8 2 type23 Set C charlie
9 2 type23 Set C delta
10 3 type31 Set B delta
11 3 type31 Set C charlie
12 3 type31 Set B beta
13 3 type31 Set C delta
14 4 type41 Set B beta
15 4 type41 Set B charlie
16 4 type42 Set A alpha
17 4 type41 Set C charlie
18 4 type41 Set B delta
19 4 type43 Set A alpha

Desired Output

id colA colB colC
1 type11 Set B beta, delta, charlie
1 type12 Set A alpha
2 type21 Set B beta, delta
2 type22 Set A alpha
2 type23 Set B, Set C charlie, delta
3 type31 Set B, Set C beta, delta, charlie
4 type41 Set B, Set C beta, delta, charlie
4 type42 Set A alpha
4 type43 Set A alpha

Attempt for one column only that returns a string but still has duplicates

df2 = df.groupby(['id', 'colA'])['colB'].apply(', '.join).reset_index()
print(df2)
 id colA colB
0 1 type11 Set B, Set B, Set B
1 1 type12 Set A
2 2 type21 Set B, Set B
3 2 type22 Set A
4 2 type23 Set B, Set C, Set C
5 3 type31 Set B, Set C, Set B, Set C
6 4 type41 Set B, Set B, Set C, Set B
7 4 type42 Set A
8 4 type43 Set A

Attempt for one column only that removes duplicates but returns a set

df2 = df.groupby(['id', 'colA'])['colB'].apply(list).apply(set).reset_index()
print(df2)
 id colA colB
0 1 type11 {Set B}
1 1 type12 {Set A}
2 2 type21 {Set B}
3 2 type22 {Set A}
4 2 type23 {Set B, Set C}
5 3 type31 {Set B, Set C}
6 4 type41 {Set B, Set C}
7 4 type42 {Set A}
8 4 type43 {Set A}
asked Mar 29, 2021 at 19:11
0

2 Answers 2

1

You are close, just further apply list() and join() as follows:

df2 = df.groupby(['id', 'colA']).apply(list).apply(set).apply(list).apply(', '.join).reset_index()

If you want to use lambda function, use it with agg() as follows:

df2 = df.groupby(['id', 'colA']).agg(lambda x: ', '.join(list(set(list(x))))).reset_index()
print(df2)
 id colA colB colC
0 1 type11 Set B delta, charlie, beta
1 1 type12 Set A alpha
2 2 type21 Set B delta, beta
3 2 type22 Set A alpha
4 2 type23 Set B, Set C delta, charlie
5 3 type31 Set B, Set C delta, charlie, beta
6 4 type41 Set B, Set C delta, charlie, beta
7 4 type42 Set A alpha
8 4 type43 Set A alpha
answered Mar 29, 2021 at 19:29

Comments

0

You can use a lambda in your groupby. drop_duplicates on the Series within the group then join the string. agg will work on all columns that aren't your grouping columns, or specify a subset.

df.groupby(['id', 'colA']).agg(lambda x: ', '.join(x.drop_duplicates())).reset_index()

 id colA colB colC
0 1 type11 Set B beta, delta, charlie
1 1 type12 Set A alpha
2 2 type21 Set B beta, delta
3 2 type22 Set A alpha
4 2 type23 Set B, Set C charlie, delta
5 3 type31 Set B, Set C delta, charlie, beta
6 4 type41 Set B, Set C beta, charlie, delta
7 4 type42 Set A alpha
8 4 type43 Set A alpha
answered Mar 29, 2021 at 19:29

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.