(Somewhat) new to python and coding in general! Using python 3.7 and pandas, I'm running code to create a searchable list of people in my dataframe, and I feel like my way of telling duplicates apart is very roundabout. I would love some advice on making it simpler and more efficient.
I have a database in pandas, 'df2', which has three relevant rows: 'First Name', 'Last Name' and 'People ID'. Some people will have the same name, and these can be told apart by their 'People ID'. I start by making the 'Name' column as thus:
df2['Name'] = df2['First Name'] + ' ' + df2['Last Name']
Now I make a Dict, nameid, to find out how many different People IDs are associated to each unique 'Name' string.
nameid = {}
for i in df2.index:
try:
nameid[df2.loc[i, 'Name']].append(df2.loc[i, 'People ID'])
except:
nameid[df2.loc[i, 'Name']] = [df2.loc[i, 'People ID']]
There are multiple occurences of each person in the spreadsheet, so I want to just have each unique instance of a different 'People ID' using set().
for i in nameid.keys():
nameid[i] = list(set(nameid[i]))
Now I create a second dict, namead, which is a "filtered" version of nameid where we've removed all reviewer names with just one ID value associated (those are fine as they are).
namead = {}
for i in nameid.keys():
paceholder = ['Nothing']
try:
paceholder.append(nameid[i][1])
namead[i] = nameid[i]
except:
pass
Finally, I use namead to make dupes, the list of index values of df2 where there are names belonging to different reviewers. I then pass that through df2 to add the 'People ID' to those names and ensure there is no confusion.
dupes = [i for i in df2.index if df2.loc[i, 'Name'] in namead.keys()]
for i in duperevs:
df2.loc[i, 'Name'] += ' ' + str(df2.loc[i, 'People ID'])
Whew! I feel like I added several layers of complexity here but I'm not sure where to start - help would be much appreciated!
EDIT - I'm not sure how to put an extract of the dataframe in this textbox. Clarification: each row of data has information, and I need it to be searchable by name for the end user, with a placeholder to differentiate identical names (the People ID). The resulting data looks like: "Frank Jones / 14498", "Mitin Nutil / 35589", "Maveh Kadini 1433 / 1433" (indicating that there's more than one Maveh Kadini in the data). Each person (by People ID) will appear in many different rows of data.
-
1\$\begingroup\$ Can you give an example dataframe of ~15/20 people and the expected result? \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2018年10月15日 14:17:35 +00:00Commented Oct 15, 2018 at 14:17
2 Answers 2
First I will comment on the coding, then I will suggest an alternative solution
Variable naming
Ideally you want your variable names to express what the purpose of a variable is. nameid
and df2
are partly understandable, namead
is not.
Exception
Never just do try-except
. Always try to be as specific as possible on the exception. In the creation of nameid
you except a KeyError
, and in the other case a IndexError
. Then use it like that
collections.defaultdict
or dict.setdefault
Instead of trying and getting the KeyError
, it is easier to use
nameid = defaultdict(list)
for i in df2.index:
nameid[df2.loc[i, 'Name']].append(df2.loc[i, 'People ID'])
would work
iteration
apart from the fact that you want to prevent iteration as much as possible when using pandas (or numpy), in Python in general, there is almost always a better way than to iterator over the index. In this particular case, you can use DataFrame.iterrows
. If the column labels had been valid python identifiers, DataFrame.itertuples
would have been even better
nameid = defaultdict(list)
for _, row in df2.iterrows():
nameid[row['Name']].append(row['People ID'])
with _
being the convention for the name of a variable you don't need
set
later you do
for i in nameid.keys():
nameid[i] = list(set(nameid[i]))
Why not use set from the start, and why the conversion to list afterwards?
nameid = defaultdict(set)
for _, row in df2.iterrows():
nameid[row['Name']].add(row['People ID'])
does all you need
len
the next part
namead = {}
for i in nameid.keys():
paceholder = ['Nothing']
try:
paceholder.append(nameid[i][1])
namead[i] = nameid[i]
except:
pass
uses list indexing [1]
and expects lists with only 1 element to throw an Exception. This can be expressed a lot simpler with a dict expression. Later it seems the actual ids are not even needed since you only use the keys, so a set expression does the trick.
namead = {
name
for name, ids in nameid.items()
if len(ids) > 1
}
pandas indexing
dupes = [i for i in df2.index if df2.loc[i, 'Name'] in namead.keys()]
for i in duperevs:
df2.loc[i, 'Name'] += ' ' + str(df2.loc[i, 'People ID'])
pandas.Series
as an isin
function, so there is no need to make an intermediate list of keys whose name has more id's
dupes = df2["Name"].isin(namead)
df2.loc[dupes, "Name"] += " " + df2["People ID"].astype(str)
works too
Alternative approach
You can also use groupby.nunique
dupes = df2.groupby(['Name'])["People ID"].nunique()
This is a Series
with the Name
as index, and the number of unique People ID
s as value. Then you can filter those with a value> 1, and check against the name
Name Jane Doe 1 John Doe 2 Name: People ID, dtype: int64
dupes_idx = df2["Name"].isin(dupes[dupes>1].index)
Now you just need to append the People ID
to the Name
df2.loc[dupes_idx, "Name"] += " " + df2["People ID"].astype(str)
This was tested on
data = [
["John", "Doe", 1],
["John", "Doe", 1],
["John", "Doe", 2],
["Jane", "Doe", 1],
["John", "Doe", 2],
["Jane", "Doe", 1],
["John", "Doe", 1],
]
df2 = pd.DataFrame(data, columns=['First Name', 'Last Name', 'People ID'])
First Name Last Name People ID Name 0 John Doe 1 John Doe 1 1 John Doe 1 John Doe 1 2 John Doe 2 John Doe 2 3 Jane Doe 1 Jane Doe 4 John Doe 2 John Doe 2 5 Jane Doe 1 Jane Doe 6 John Doe 1 John Doe 1
-
\$\begingroup\$ Thank you! All of this is so useful, and I'm implementing the groupby.nunique() now. \$\endgroup\$Jim Eisenberg– Jim Eisenberg2018年10月16日 08:22:25 +00:00Commented Oct 16, 2018 at 8:22
There are multiple occurences of each person in the spreadsheet, so I want to just have each unique instance of a different 'People ID' using set().
df.groupby('Name').apply(set)
Now I create a second dict, namead, which is a "filtered" version of nameid where we've removed all reviewer names with just one ID value associated (those are fine as they are).
df.groupby('Name').apply(set).loc[df.groupby('Name').count>1]
I then pass that through df2 to add the 'People ID' to those names and ensure there is no confusion.
I think that just doing df['Name'] = df['Name']+(' '+'df['People ID'])*(df.groupby('Name').count>1)
will do everything you want, but I recommend just doing df['Name'] = df['Name']+' '+'df['People ID']
. That is, you should just add People ID
to all of the Name
column, regardless of whether the name is unique. Consistency is generally better than parsimony.