12

I have two columns with a lot of duplicated items per cell in a dataframe. Something similar to this:

Index x y 
 1 1 ec, us, us, gbr, lst
 2 5 ec, us, us, us, us, ec, ec, ec, ec
 3 8 ec, us, us, gbr, lst, lst, lst, lst, gbr
 4 5 ec, ec, ec, us, us, ir, us, ec, ir, ec, ec
 5 7 chn, chn, chn, ec, ec, us, us, gbr, lst

I need to eliminate all the duplicate items an get a resulting dataframe like this:

Index x y 
 1 1 ec, us, gbr, lst
 2 5 ec, us
 3 8 ec, us, gbr,lst
 4 5 ec, us, ir
 5 7 chn, ec, us, gbr, lst

Thanks!!

asked Jan 4, 2018 at 4:29
2
  • So, what did you already try out in order to get the result you want? Commented Jan 4, 2018 at 4:30
  • stackoverflow.com/questions/7794208/… mutiple function there, what you need is just apply those to your dataframe Commented Jan 4, 2018 at 4:55

4 Answers 4

21

Split and apply set and join i.e

df['y'].str.split(', ').apply(set).str.join(', ')
0 us, ec, gbr, lst
1 us, ec
2 us, ec, gbr, lst
3 us, ec, ir
4 us, lst, ec, gbr, chn
Name: y, dtype: object

Update based on comment :

df['y'].str.replace('nan|[{}\s]','', regex=True).str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",", regex=True)
# Replace all the braces and nan with `''`, then split and apply set and join
answered Jan 4, 2018 at 4:34
5
  • it works perfect @Dark ... but I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it? Commented Jan 4, 2018 at 5:11
  • @PAstudilloE are you saying the y column is like {ec,us.. before running this code or after running this code? Commented Jan 4, 2018 at 5:15
  • before running the code. the original columns are {ec, us, ..., nan} @Dark Commented Jan 4, 2018 at 5:18
  • it works well. The only problem that I have now is that the results I'm getting are like this: , , us, ec... (the nan's are erased but the commas are still there). Do you have any guidance on how to solve that? Commented Jan 4, 2018 at 5:49
  • For FutureWarning error add regex=True in replace Commented Sep 14, 2021 at 19:56
1

Try this:

d['y'] = d['y'].apply(lambda x: ', '.join(sorted(set(x.split(', ')))))
answered Jan 4, 2018 at 4:37
1
  • it works perfect!... but I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it? Commented Jan 4, 2018 at 4:43
1

If you don't care about item order, and assuming the data type of everything in column y is a string, you can use the following snippet:

df['y'] = df['y'].apply(lambda s: ', '.join(set(s.split(', '))))

The set() conversion is what removes duplicates. I think in later versions of python it might preserve order (3.4+ maybe?), but that is an implementation detail rather than a language specification.

answered Jan 4, 2018 at 4:36
3
  • That call to list isn't needed. Commented Jan 4, 2018 at 4:41
  • I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it? Commented Jan 4, 2018 at 4:42
  • Even in Python 3.10, sets are documented as unordered collections, so they should not be used if the order in which items are inserted or enumerated is important to a program. Commented Jan 2, 2022 at 17:52
0

use the apply method on the dataframe.

# change this function according to your needs
def dedup(row):
 return list(set(row.y))
df['deduped'] = df.apply(dedup, axis=1)
answered Jan 4, 2018 at 4:44

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.