Remove duplicates from rows and columns (cell) in a dataframe, python

Question 1

I have two columns with a lot of duplicated items per cell in a dataframe. Something similar to this:

Index x y 
 1 1 ec, us, us, gbr, lst
 2 5 ec, us, us, us, us, ec, ec, ec, ec
 3 8 ec, us, us, gbr, lst, lst, lst, lst, gbr
 4 5 ec, ec, ec, us, us, ir, us, ec, ir, ec, ec
 5 7 chn, chn, chn, ec, ec, us, us, gbr, lst

I need to eliminate all the duplicate items an get a resulting dataframe like this:

Index x y 
 1 1 ec, us, gbr, lst
 2 5 ec, us
 3 8 ec, us, gbr,lst
 4 5 ec, us, ir
 5 7 chn, ec, us, gbr, lst

Thanks!!

Question 2

So, what did you already try out in order to get the result you want?

Question 3

stackoverflow.com/questions/7794208/… mutiple function there, what you need is just apply those to your dataframe

Question 4

Split and apply set and join i.e

df['y'].str.split(', ').apply(set).str.join(', ')
0 us, ec, gbr, lst
1 us, ec
2 us, ec, gbr, lst
3 us, ec, ir
4 us, lst, ec, gbr, chn
Name: y, dtype: object

Update based on comment :

df['y'].str.replace('nan|[{}\s]','', regex=True).str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",", regex=True)
# Replace all the braces and nan with `''`, then split and apply set and join

Question 5

it works perfect @Dark ... but I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it?

Question 6

@PAstudilloE are you saying the y column is like {ec,us.. before running this code or after running this code?

Question 7

before running the code. the original columns are {ec, us, ..., nan} @Dark

Question 8

it works well. The only problem that I have now is that the results I'm getting are like this: , , us, ec... (the nan's are erased but the commas are still there). Do you have any guidance on how to solve that?

Question 9

For FutureWarning error add regex=True in replace

Question 10

Try this:

d['y'] = d['y'].apply(lambda x: ', '.join(sorted(set(x.split(', ')))))

Question 11

it works perfect!... but I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it?

Question 12

If you don't care about item order, and assuming the data type of everything in column y is a string, you can use the following snippet:

df['y'] = df['y'].apply(lambda s: ', '.join(set(s.split(', '))))

The set() conversion is what removes duplicates. I think in later versions of python it might preserve order (3.4+ maybe?), but that is an implementation detail rather than a language specification.

Question 13

That call to list isn't needed.

Question 14

I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it?

Question 15

Even in Python 3.10, sets are documented as unordered collections, so they should not be used if the order in which items are inserted or enumerated is important to a program.

Question 16

use the apply method on the dataframe.

# change this function according to your needs
def dedup(row):
 return list(set(row.y))
df['deduped'] = df.apply(dedup, axis=1)

score 21 · Accepted Answer · 2018-01-04 04:34:49Z

21

Split and apply set and join i.e

df['y'].str.split(', ').apply(set).str.join(', ')
0 us, ec, gbr, lst
1 us, ec
2 us, ec, gbr, lst
3 us, ec, ir
4 us, lst, ec, gbr, chn
Name: y, dtype: object

Update based on comment :

df['y'].str.replace('nan|[{}\s]','', regex=True).str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",", regex=True)
# Replace all the braces and nan with `''`, then split and apply set and join

Share

Improve this answer

edited Sep 15, 2021 at 9:40

kağan hazal koçdemir's user avatar

kağan hazal koçdemir

7255 silver badges18 bronze badges

answered Jan 4, 2018 at 4:34

Bharath M Shetty's user avatar

Bharath M Shetty Bharath M Shetty

30.6k6 gold badges65 silver badges111 bronze badges

5

it works perfect @Dark ... but I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it?

PAstudilloE
– PAstudilloE

2018年01月04日 05:11:28 +00:00
Commented Jan 4, 2018 at 5:11
@PAstudilloE are you saying the y column is like {ec,us.. before running this code or after running this code?

Bharath M Shetty
– Bharath M Shetty

2018年01月04日 05:15:32 +00:00
Commented Jan 4, 2018 at 5:15
before running the code. the original columns are {ec, us, ..., nan} @Dark

PAstudilloE
– PAstudilloE

2018年01月04日 05:18:35 +00:00
Commented Jan 4, 2018 at 5:18
it works well. The only problem that I have now is that the results I'm getting are like this: , , us, ec... (the nan's are erased but the commas are still there). Do you have any guidance on how to solve that?

PAstudilloE
– PAstudilloE

2018年01月04日 05:49:04 +00:00
Commented Jan 4, 2018 at 5:49
For FutureWarning error add regex=True in replace

kağan hazal koçdemir
– kağan hazal koçdemir

2021年09月14日 19:56:37 +00:00
Commented Sep 14, 2021 at 19:56

Add a comment |

CollectivesTM on Stack Overflow

Remove duplicates from rows and columns (cell) in a dataframe, python

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related