Question: was using get_dummies
a good choice for converting categorical strings?
I used get_dummies
to convert categorical variables into dummy / indicator variables for a cold start recommender system. It's only using category type information and some basic limited choices.
The code works and the output seems good. This is my first data science project, which is for fun. I kind of put this together from reading documentation and searching Stack Overflow. I plan on making it a hybrid recommender system soon by adding sentiment analysis and topic classification. Both of which I also recently finished.
To check another character just Input a different character name in userInput
. The full notebook and Excel sheet for import are on my GitHub.
I would be grateful if someone could comment on whether or not I was able to achieve the following goals (and of course if not, what I can improve):
- Code structure,
- Style and readability: Is the code comprehensible?
- Are there any bad practices?
These are the attributes in my code:
Character's name (must be unique)
herotype (must be one of following choices)
- Bard
- Sorcerer
- Paladin
- Rogue
- Druid
- Sorcerer
weapons (can have one or multiple of following choices)
- Dagger
- sling
- club
- light crossbow
- battleaxe
- Greataxe
spells (can have one or multiple of following choices)
- Transmutation
- Enchantment
- Necromancy
- Abjuration
- Conjuration
- Evocation
Input and Output
Get Another Recommendation
You just input the username 'Irv' to another one like 'Zed Ryley' etc
userInput = [
{'name':'Irv', 'rating':1}
The results
come back formatted like this
name herotype weapons spells
28 Irv Sorcerer light crossbow Conjuration
9 yac Sorcerer Greataxe Conjuration, Evocation, Transmutation
18 Traubon Durthane Sorcerer light crossbow Evocation, Transmutation, Necromancy
8 wuc Sorcerer light crossbow, battleaxe Necromancy
1 niem Sorcerer light crossbow, battleaxe Necromancy
23 Zed Ryley Sorcerer sling Evocation
For comparison
Here are the scores which show how it ranks the results.
In [5]:
recommendationTable_df.head(6)
Out[5]:
28 1.000000
9 0.666667
18 0.666667
8 0.666667
1 0.666667
23 0.333333
dtype: float64
Code
#imports
import pandas as pd
import numpy as np
df = pd.read_excel('dnd-dataframe.xlsx', sheet_name=0, usecols=['name', 'weapons','herotype','spells'])
df.head(30)
dummies1 = df['weapons'].str.get_dummies(sep=',')
dummies2 = df['spells'].str.get_dummies(sep=',')
dummies3 = df['herotype'].str.get_dummies(sep=',')
genre_data = pd.concat([df, dummies1,dummies2, dummies3], axis=1)
userInput = [
{'name':'Irv', 'rating':1} #Their is no rating system being used so by default rating is set to 1
]
inputname = pd.DataFrame(userInput)
inputId = df[df['name'].isin(inputname['name'].tolist())]
#Then merging it so we can get the name. It's implicitly merging spells it by name.
inputname = pd.merge(inputId, inputname)
#Dropping information we won't use from the input dataframe
inputname = inputname.drop('weapons',1).drop('spells',1).drop('herotype',1)
#Filtering out the names from the input
username = genre_data[genre_data['name'].isin(inputname['name'].tolist())]
#Resetting the index to avoid future issues
username = username.reset_index(drop=True)
#Dropping unnecessary issues due to save memory and to avoid issues
userGenreTable = username.drop('name',1).drop('weapons',1).drop('spells',1).drop('herotype',1)
#Dot product to get weights
userProfile = userGenreTable.transpose().dot(inputname['rating'])
genreTable = genre_data.copy()
genreTable = genreTable.drop('name',1).drop('weapons',1).drop('spells',1).drop('herotype',1)
#Multiply the genres by the weights and then take the weighted average
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#df.loc[df.index.isin(recommendationTable_df.head(3).keys())] #adjust the value of 3 here
df.loc[recommendationTable_df.head(6).index, :]
1 Answer 1
The dummies{1,2,3} identifiers are kind of OK, but not very mnemonic.
Actually, there's no need to invent new names at all, we can go the DRY anonymous route.
dummies = [df[col].str.get_dummies(sep=',')
for col in ['weapons', 'spells', 'herotype']]
genre_data = pd.concat([df] + dummies, axis=1)
Now that we have the dummies, this would be an appropriate time to drop those three source columns.
PEP-8
asks that you spell it user_input
.
We're accumulating a bunch of stale variables
in the global namespace, like user_input
and dummies
.
I recommend you define the occasional helper function.
That gives no-longer-needed temp variables a chance
to go out of scope when the function exits.
DRY.
We seem to be doing some repeated operations:
dropping three source columns, checking inputname
.
Maybe we need a separate username
dataframe.
But maybe we could have just added the occasional
column to genre_data
instead?
The # comments
are a bit more chatty than my own style,
but that's fine; I imagine you find them helpful.
I will note that naming a helper function goes a
long way toward explaining the
Single Responsibility
of a few lines of code. And if more explanation is needed,
there's lots of room for that in the helper's """docstring"""
.
Also, a small helper function is a function that is easily unit tested.
zomg, we're .drop()
ing those three source columns yet again?!?
Couldn't we have banished them just once, near the beginning?
The identifier recommendationTable_df
is an abomination.
Please don't do that. Pick an approach to naming, perhaps
the pascal / JS recommendationTableDf
, or since this
is Python, perhaps the PEP-8 recommendation_table_df
.
But mixing them won't help to improve the readability
of the code.
You have a commented expression containing .head(3)
- I'm sure
it was helpful during development, but it's time to remove it
from final production code, now.
The last line includes the
magic number
6
. Please give it a name.
Often this is most conveniently done in a function signature:
def get_final_recommendations(num_recs=6):
return df.loc[recommendationTable_df.head(num_recs).index, :]
Overall, I can see there was a clear emphasis on trying to make the code readable and maintainable.
("Maintenance" means new features and bug fixes.)
I would be reluctant to hand this off, in its current state, for some new hire to perform a maintenance task. The maintainer would probably need to discuss things with the original author.
Frozen in / out data, unit tests, and small helpers would help to improve the maintainability situation.