(Somewhat) new to python and coding in general! Using python 3.7 and pandas, I'm running code to create a searchable list of people in my dataframe, and I feel like my way of telling duplicates apart is very roundabout. I would love some advice on making it simpler and more efficient.
I have a database in pandas, 'df2', which has three relevant rows: 'First Name', 'Last Name' and 'People ID'. Some people will have the same name, and these can be told apart by their 'People ID'. I start by making the 'Name' column as thus:
df2['Name'] = df2['First Name'] + ' ' + df2['Last Name']
Now I make a Dict, nameid, to find out how many different People IDs are associated to each unique 'Name' string.
nameid = {}
for i in df2.index:
try:
nameid[df2.loc[i, 'Name']].append(df2.loc[i, 'People ID'])
except:
nameid[df2.loc[i, 'Name']] = [df2.loc[i, 'People ID']]
There are multiple occurences of each person in the spreadsheet, so I want to just have each unique instance of a different 'People ID' using set().
for i in nameid.keys():
nameid[i] = list(set(nameid[i]))
Now I create a second dict, namead, which is a "filtered" version of nameid where we've removed all reviewer names with just one ID value associated (those are fine as they are).
namead = {}
for i in nameid.keys():
paceholder = ['Nothing']
try:
paceholder.append(nameid[i][1])
namead[i] = nameid[i]
except:
pass
Finally, I use namead to make dupes, the list of index values of df2 where there are names belonging to different reviewers. I then pass that through df2 to add the 'People ID' to those names and ensure there is no confusion.
dupes = [i for i in df2.index if df2.loc[i, 'Name'] in namead.keys()]
for i in duperevs:
df2.loc[i, 'Name'] += ' ' + str(df2.loc[i, 'People ID'])
Whew! I feel like I added several layers of complexity here but I'm not sure where to start - help would be much appreciated!
EDIT - I'm not sure how to put an extract of the dataframe in this textbox. Clarification: each row of data has information, and I need it to be searchable by name for the end user, with a placeholder to differentiate identical names (the People ID). The resulting data looks like: "Frank Jones / 14498", "Mitin Nutil / 35589", "Maveh Kadini 1433 / 1433" (indicating that there's more than one Maveh Kadini in the data). Each person (by People ID) will appear in many different rows of data.
-
1\$\begingroup\$ Can you give an example dataframe of ~15/20 people and the expected result? \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2018年10月15日 14:17:35 +00:00Commented Oct 15, 2018 at 14:17
2 Answers 2
First I will comment on the coding, then I will suggest an alternative solution
Variable naming
Ideally you want your variable names to express what the purpose of a variable is. nameid and df2 are partly understandable, namead is not.
Exception
Never just do try-except. Always try to be as specific as possible on the exception. In the creation of nameid you except a KeyError, and in the other case a IndexError. Then use it like that
collections.defaultdict or dict.setdefault
Instead of trying and getting the KeyError, it is easier to use
nameid = defaultdict(list)
for i in df2.index:
nameid[df2.loc[i, 'Name']].append(df2.loc[i, 'People ID'])
would work
iteration
apart from the fact that you want to prevent iteration as much as possible when using pandas (or numpy), in Python in general, there is almost always a better way than to iterator over the index. In this particular case, you can use DataFrame.iterrows. If the column labels had been valid python identifiers, DataFrame.itertuples would have been even better
nameid = defaultdict(list)
for _, row in df2.iterrows():
nameid[row['Name']].append(row['People ID'])
with _ being the convention for the name of a variable you don't need
set
later you do
for i in nameid.keys():
nameid[i] = list(set(nameid[i]))
Why not use set from the start, and why the conversion to list afterwards?
nameid = defaultdict(set)
for _, row in df2.iterrows():
nameid[row['Name']].add(row['People ID'])
does all you need
len
the next part
namead = {}
for i in nameid.keys():
paceholder = ['Nothing']
try:
paceholder.append(nameid[i][1])
namead[i] = nameid[i]
except:
pass
uses list indexing [1] and expects lists with only 1 element to throw an Exception. This can be expressed a lot simpler with a dict expression. Later it seems the actual ids are not even needed since you only use the keys, so a set expression does the trick.
namead = {
name
for name, ids in nameid.items()
if len(ids) > 1
}
pandas indexing
dupes = [i for i in df2.index if df2.loc[i, 'Name'] in namead.keys()]
for i in duperevs:
df2.loc[i, 'Name'] += ' ' + str(df2.loc[i, 'People ID'])
pandas.Series as an isin function, so there is no need to make an intermediate list of keys whose name has more id's
dupes = df2["Name"].isin(namead)
df2.loc[dupes, "Name"] += " " + df2["People ID"].astype(str)
works too
Alternative approach
You can also use groupby.nunique
dupes = df2.groupby(['Name'])["People ID"].nunique()
This is a Series with the Name as index, and the number of unique People IDs as value. Then you can filter those with a value> 1, and check against the name
Name Jane Doe 1 John Doe 2 Name: People ID, dtype: int64
dupes_idx = df2["Name"].isin(dupes[dupes>1].index)
Now you just need to append the People ID to the Name
df2.loc[dupes_idx, "Name"] += " " + df2["People ID"].astype(str)
This was tested on
data = [
["John", "Doe", 1],
["John", "Doe", 1],
["John", "Doe", 2],
["Jane", "Doe", 1],
["John", "Doe", 2],
["Jane", "Doe", 1],
["John", "Doe", 1],
]
df2 = pd.DataFrame(data, columns=['First Name', 'Last Name', 'People ID'])
First Name Last Name People ID Name 0 John Doe 1 John Doe 1 1 John Doe 1 John Doe 1 2 John Doe 2 John Doe 2 3 Jane Doe 1 Jane Doe 4 John Doe 2 John Doe 2 5 Jane Doe 1 Jane Doe 6 John Doe 1 John Doe 1
-
\$\begingroup\$ Thank you! All of this is so useful, and I'm implementing the groupby.nunique() now. \$\endgroup\$Jim Eisenberg– Jim Eisenberg2018年10月16日 08:22:25 +00:00Commented Oct 16, 2018 at 8:22
There are multiple occurences of each person in the spreadsheet, so I want to just have each unique instance of a different 'People ID' using set().
df.groupby('Name').apply(set)
Now I create a second dict, namead, which is a "filtered" version of nameid where we've removed all reviewer names with just one ID value associated (those are fine as they are).
df.groupby('Name').apply(set).loc[df.groupby('Name').count>1]
I then pass that through df2 to add the 'People ID' to those names and ensure there is no confusion.
I think that just doing df['Name'] = df['Name']+(' '+'df['People ID'])*(df.groupby('Name').count>1) will do everything you want, but I recommend just doing df['Name'] = df['Name']+' '+'df['People ID']. That is, you should just add People ID to all of the Name column, regardless of whether the name is unique. Consistency is generally better than parsimony.