Python Pandas - finding duplicate names and telling them apart

Question 1

(Somewhat) new to python and coding in general! Using python 3.7 and pandas, I'm running code to create a searchable list of people in my dataframe, and I feel like my way of telling duplicates apart is very roundabout. I would love some advice on making it simpler and more efficient.

I have a database in pandas, 'df2', which has three relevant rows: 'First Name', 'Last Name' and 'People ID'. Some people will have the same name, and these can be told apart by their 'People ID'. I start by making the 'Name' column as thus:

df2['Name'] = df2['First Name'] + ' ' + df2['Last Name']

Now I make a Dict, nameid, to find out how many different People IDs are associated to each unique 'Name' string.

nameid = {}
for i in df2.index:
 try:
 nameid[df2.loc[i, 'Name']].append(df2.loc[i, 'People ID'])
 except:
 nameid[df2.loc[i, 'Name']] = [df2.loc[i, 'People ID']]

There are multiple occurences of each person in the spreadsheet, so I want to just have each unique instance of a different 'People ID' using set().

for i in nameid.keys():
 nameid[i] = list(set(nameid[i]))

Now I create a second dict, namead, which is a "filtered" version of nameid where we've removed all reviewer names with just one ID value associated (those are fine as they are).

namead = {}
for i in nameid.keys():
 paceholder = ['Nothing']
 try:
 paceholder.append(nameid[i][1])
 namead[i] = nameid[i]
 except:
 pass

Finally, I use namead to make dupes, the list of index values of df2 where there are names belonging to different reviewers. I then pass that through df2 to add the 'People ID' to those names and ensure there is no confusion.

dupes = [i for i in df2.index if df2.loc[i, 'Name'] in namead.keys()]
for i in duperevs:
 df2.loc[i, 'Name'] += ' ' + str(df2.loc[i, 'People ID'])

Whew! I feel like I added several layers of complexity here but I'm not sure where to start - help would be much appreciated!

EDIT - I'm not sure how to put an extract of the dataframe in this textbox. Clarification: each row of data has information, and I need it to be searchable by name for the end user, with a placeholder to differentiate identical names (the People ID). The resulting data looks like: "Frank Jones / 14498", "Mitin Nutil / 35589", "Maveh Kadini 1433 / 1433" (indicating that there's more than one Maveh Kadini in the data). Each person (by People ID) will appear in many different rows of data.

Question 2

Can you give an example dataframe of ~15/20 people and the expected result?

Question 3

First I will comment on the coding, then I will suggest an alternative solution

Variable naming

Ideally you want your variable names to express what the purpose of a variable is. nameid and df2 are partly understandable, namead is not.

Exception

Never just do try-except. Always try to be as specific as possible on the exception. In the creation of nameid you except a KeyError, and in the other case a IndexError. Then use it like that

`collections.defaultdict` or `dict.setdefault`

Instead of trying and getting the KeyError, it is easier to use

nameid = defaultdict(list)
for i in df2.index:
 nameid[df2.loc[i, 'Name']].append(df2.loc[i, 'People ID'])

would work

iteration

apart from the fact that you want to prevent iteration as much as possible when using pandas (or numpy), in Python in general, there is almost always a better way than to iterator over the index. In this particular case, you can use DataFrame.iterrows. If the column labels had been valid python identifiers, DataFrame.itertuples would have been even better

nameid = defaultdict(list)
for _, row in df2.iterrows():
 nameid[row['Name']].append(row['People ID'])

with _ being the convention for the name of a variable you don't need

set

later you do

for i in nameid.keys():
 nameid[i] = list(set(nameid[i]))

Why not use set from the start, and why the conversion to list afterwards?

nameid = defaultdict(set)
for _, row in df2.iterrows():
 nameid[row['Name']].add(row['People ID'])

does all you need

`len`

the next part

namead = {}
for i in nameid.keys():
 paceholder = ['Nothing']
 try:
 paceholder.append(nameid[i][1])
 namead[i] = nameid[i]
 except:
 pass

uses list indexing [1] and expects lists with only 1 element to throw an Exception. This can be expressed a lot simpler with a dict expression. Later it seems the actual ids are not even needed since you only use the keys, so a set expression does the trick.

namead = {
 name
 for name, ids in nameid.items()
 if len(ids) > 1
}

pandas indexing

dupes = [i for i in df2.index if df2.loc[i, 'Name'] in namead.keys()]
for i in duperevs:
 df2.loc[i, 'Name'] += ' ' + str(df2.loc[i, 'People ID'])

pandas.Series as an isin function, so there is no need to make an intermediate list of keys whose name has more id's

dupes = df2["Name"].isin(namead)
df2.loc[dupes, "Name"] += " " + df2["People ID"].astype(str)

works too

Alternative approach

You can also use groupby.nunique

dupes = df2.groupby(['Name'])["People ID"].nunique()

This is a Series with the Name as index, and the number of unique People IDs as value. Then you can filter those with a value> 1, and check against the name

Name
Jane Doe 1
John Doe 2
Name: People ID, dtype: int64

dupes_idx = df2["Name"].isin(dupes[dupes>1].index)

Now you just need to append the People ID to the Name

df2.loc[dupes_idx, "Name"] += " " + df2["People ID"].astype(str)

This was tested on

data = [
 ["John", "Doe", 1],
 ["John", "Doe", 1],
 ["John", "Doe", 2],
 ["Jane", "Doe", 1],
 ["John", "Doe", 2],
 ["Jane", "Doe", 1],
 ["John", "Doe", 1],
]
df2 = pd.DataFrame(data, columns=['First Name', 'Last Name', 'People ID'])

First Name Last Name People ID Name
0 John Doe 1 John Doe 1
1 John Doe 1 John Doe 1
2 John Doe 2 John Doe 2
3 Jane Doe 1 Jane Doe
4 John Doe 2 John Doe 2
5 Jane Doe 1 Jane Doe
6 John Doe 1 John Doe 1

Question 4

Thank you! All of this is so useful, and I'm implementing the groupby.nunique() now.

Question 5

There are multiple occurences of each person in the spreadsheet, so I want to just have each unique instance of a different 'People ID' using set().

df.groupby('Name').apply(set)

Now I create a second dict, namead, which is a "filtered" version of nameid where we've removed all reviewer names with just one ID value associated (those are fine as they are).

df.groupby('Name').apply(set).loc[df.groupby('Name').count>1]

I then pass that through df2 to add the 'People ID' to those names and ensure there is no confusion.

I think that just doing df['Name'] = df['Name']+(' '+'df['People ID'])*(df.groupby('Name').count>1) will do everything you want, but I recommend just doing df['Name'] = df['Name']+' '+'df['People ID']. That is, you should just add People ID to all of the Name column, regardless of whether the name is unique. Consistency is generally better than parsimony.

Maarten Fabré 9,4001 gold badge15 silver badges27 bronze badges · Accepted Answer · 2018-10-15 16:15:12Z

First I will comment on the coding, then I will suggest an alternative solution

Variable naming

Ideally you want your variable names to express what the purpose of a variable is. nameid and df2 are partly understandable, namead is not.

Exception

Never just do try-except. Always try to be as specific as possible on the exception. In the creation of nameid you except a KeyError, and in the other case a IndexError. Then use it like that

`collections.defaultdict` or `dict.setdefault`

Instead of trying and getting the KeyError, it is easier to use

nameid = defaultdict(list)
for i in df2.index:
 nameid[df2.loc[i, 'Name']].append(df2.loc[i, 'People ID'])

would work

iteration

apart from the fact that you want to prevent iteration as much as possible when using pandas (or numpy), in Python in general, there is almost always a better way than to iterator over the index. In this particular case, you can use DataFrame.iterrows. If the column labels had been valid python identifiers, DataFrame.itertuples would have been even better

nameid = defaultdict(list)
for _, row in df2.iterrows():
 nameid[row['Name']].append(row['People ID'])

with _ being the convention for the name of a variable you don't need

set

later you do

for i in nameid.keys():
 nameid[i] = list(set(nameid[i]))

Why not use set from the start, and why the conversion to list afterwards?

nameid = defaultdict(set)
for _, row in df2.iterrows():
 nameid[row['Name']].add(row['People ID'])

does all you need

`len`

the next part

namead = {}
for i in nameid.keys():
 paceholder = ['Nothing']
 try:
 paceholder.append(nameid[i][1])
 namead[i] = nameid[i]
 except:
 pass

uses list indexing [1] and expects lists with only 1 element to throw an Exception. This can be expressed a lot simpler with a dict expression. Later it seems the actual ids are not even needed since you only use the keys, so a set expression does the trick.

namead = {
 name
 for name, ids in nameid.items()
 if len(ids) > 1
}

pandas indexing

dupes = [i for i in df2.index if df2.loc[i, 'Name'] in namead.keys()]
for i in duperevs:
 df2.loc[i, 'Name'] += ' ' + str(df2.loc[i, 'People ID'])

pandas.Series as an isin function, so there is no need to make an intermediate list of keys whose name has more id's

dupes = df2["Name"].isin(namead)
df2.loc[dupes, "Name"] += " " + df2["People ID"].astype(str)

works too

Alternative approach

You can also use groupby.nunique

dupes = df2.groupby(['Name'])["People ID"].nunique()

This is a Series with the Name as index, and the number of unique People IDs as value. Then you can filter those with a value> 1, and check against the name

Name
Jane Doe 1
John Doe 2
Name: People ID, dtype: int64

dupes_idx = df2["Name"].isin(dupes[dupes>1].index)

Now you just need to append the People ID to the Name

df2.loc[dupes_idx, "Name"] += " " + df2["People ID"].astype(str)

This was tested on

data = [
 ["John", "Doe", 1],
 ["John", "Doe", 1],
 ["John", "Doe", 2],
 ["Jane", "Doe", 1],
 ["John", "Doe", 2],
 ["Jane", "Doe", 1],
 ["John", "Doe", 1],
]
df2 = pd.DataFrame(data, columns=['First Name', 'Last Name', 'People ID'])

First Name Last Name People ID Name
0 John Doe 1 John Doe 1
1 John Doe 1 John Doe 1
2 John Doe 2 John Doe 2
3 Jane Doe 1 Jane Doe
4 John Doe 2 John Doe 2
5 Jane Doe 1 Jane Doe
6 John Doe 1 John Doe 1

Thank you! All of this is so useful, and I'm implementing the groupby.nunique() now.

Stack Exchange Network

Python Pandas - finding duplicate names and telling them apart

2 Answers 2

Variable naming

Exception

`collections.defaultdict` or `dict.setdefault`

iteration

set

`len`

pandas indexing

Alternative approach

You must log in to answer this question.

Hot Network Questions

Python Pandas - finding duplicate names and telling them apart

2 Answers 2

Variable naming

Exception

collections.defaultdict or dict.setdefault

iteration

set

len

pandas indexing

Alternative approach

You must log in to answer this question.

Related

Hot Network Questions

`collections.defaultdict` or `dict.setdefault`

`len`