I have two columns with a lot of duplicated items per cell in a dataframe. Something similar to this:
Index x y
1 1 ec, us, us, gbr, lst
2 5 ec, us, us, us, us, ec, ec, ec, ec
3 8 ec, us, us, gbr, lst, lst, lst, lst, gbr
4 5 ec, ec, ec, us, us, ir, us, ec, ir, ec, ec
5 7 chn, chn, chn, ec, ec, us, us, gbr, lst
I need to eliminate all the duplicate items an get a resulting dataframe like this:
Index x y
1 1 ec, us, gbr, lst
2 5 ec, us
3 8 ec, us, gbr,lst
4 5 ec, us, ir
5 7 chn, ec, us, gbr, lst
Thanks!!
-
So, what did you already try out in order to get the result you want?1313e– 1313e2018年01月04日 04:30:52 +00:00Commented Jan 4, 2018 at 4:30
-
stackoverflow.com/questions/7794208/… mutiple function there, what you need is just apply those to your dataframeBENY– BENY2018年01月04日 04:55:17 +00:00Commented Jan 4, 2018 at 4:55
4 Answers 4
Split
and apply set
and join
i.e
df['y'].str.split(', ').apply(set).str.join(', ')
0 us, ec, gbr, lst
1 us, ec
2 us, ec, gbr, lst
3 us, ec, ir
4 us, lst, ec, gbr, chn
Name: y, dtype: object
Update based on comment :
df['y'].str.replace('nan|[{}\s]','', regex=True).str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",", regex=True)
# Replace all the braces and nan with `''`, then split and apply set and join
-
it works perfect @Dark ... but I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it?PAstudilloE– PAstudilloE2018年01月04日 05:11:28 +00:00Commented Jan 4, 2018 at 5:11
-
@PAstudilloE are you saying the y column is like {ec,us.. before running this code or after running this code?Bharath M Shetty– Bharath M Shetty2018年01月04日 05:15:32 +00:00Commented Jan 4, 2018 at 5:15
-
before running the code. the original columns are {ec, us, ..., nan} @DarkPAstudilloE– PAstudilloE2018年01月04日 05:18:35 +00:00Commented Jan 4, 2018 at 5:18
-
it works well. The only problem that I have now is that the results I'm getting are like this: , , us, ec... (the nan's are erased but the commas are still there). Do you have any guidance on how to solve that?PAstudilloE– PAstudilloE2018年01月04日 05:49:04 +00:00Commented Jan 4, 2018 at 5:49
-
For FutureWarning error add regex=True in replacekağan hazal koçdemir– kağan hazal koçdemir2021年09月14日 19:56:37 +00:00Commented Sep 14, 2021 at 19:56
Try this:
d['y'] = d['y'].apply(lambda x: ', '.join(sorted(set(x.split(', ')))))
-
it works perfect!... but I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it?PAstudilloE– PAstudilloE2018年01月04日 04:43:21 +00:00Commented Jan 4, 2018 at 4:43
If you don't care about item order, and assuming the data type of everything in column y
is a string, you can use the following snippet:
df['y'] = df['y'].apply(lambda s: ', '.join(set(s.split(', '))))
The set()
conversion is what removes duplicates. I think in later versions of python it might preserve order (3.4+ maybe?), but that is an implementation detail rather than a language specification.
-
That call to
list
isn't needed.Turn– Turn2018年01月04日 04:41:14 +00:00Commented Jan 4, 2018 at 4:41 -
I forgot to include that all the [y] column is like this: {ec, us, us, gbr, lst, nan, nan}. I need to erase erase the {} and the nan. Do you know how to do it?PAstudilloE– PAstudilloE2018年01月04日 04:42:07 +00:00Commented Jan 4, 2018 at 4:42
-
Even in Python 3.10,
set
s are documented as unordered collections, so they should not be used if the order in which items are inserted or enumerated is important to a program.Peter O.– Peter O.2022年01月02日 17:52:57 +00:00Commented Jan 2, 2022 at 17:52
use the apply
method on the dataframe.
# change this function according to your needs
def dedup(row):
return list(set(row.y))
df['deduped'] = df.apply(dedup, axis=1)