8
\$\begingroup\$

Question: was using get_dummies a good choice for converting categorical strings?

I used get_dummies to convert categorical variables into dummy / indicator variables for a cold start recommender system. It's only using category type information and some basic limited choices.

The code works and the output seems good. This is my first data science project, which is for fun. I kind of put this together from reading documentation and searching Stack Overflow. I plan on making it a hybrid recommender system soon by adding sentiment analysis and topic classification. Both of which I also recently finished.

To check another character just Input a different character name in userInput. The full notebook and Excel sheet for import are on my GitHub.

I would be grateful if someone could comment on whether or not I was able to achieve the following goals (and of course if not, what I can improve):

  • Code structure,
  • Style and readability: Is the code comprehensible?
  • Are there any bad practices?

These are the attributes in my code:

  • Character's name (must be unique)

  • herotype (must be one of following choices)

    • Bard
    • Sorcerer
    • Paladin
    • Rogue
    • Druid
    • Sorcerer
  • weapons (can have one or multiple of following choices)

    • Dagger
    • sling
    • club
    • light crossbow
    • battleaxe
    • Greataxe
  • spells (can have one or multiple of following choices)

    • Transmutation
    • Enchantment
    • Necromancy
    • Abjuration
    • Conjuration
    • Evocation

Input and Output

Get Another Recommendation
You just input the username 'Irv' to another one like 'Zed Ryley' etc

userInput = [
 {'name':'Irv', 'rating':1} 

The results
come back formatted like this

 name herotype weapons spells
28 Irv Sorcerer light crossbow Conjuration
9 yac Sorcerer Greataxe Conjuration, Evocation, Transmutation
18 Traubon Durthane Sorcerer light crossbow Evocation, Transmutation, Necromancy
8 wuc Sorcerer light crossbow, battleaxe Necromancy
1 niem Sorcerer light crossbow, battleaxe Necromancy
23 Zed Ryley Sorcerer sling Evocation

For comparison

Here are the scores which show how it ranks the results.


In [5]:
recommendationTable_df.head(6)
Out[5]:
28 1.000000
9 0.666667
18 0.666667
8 0.666667
1 0.666667
23 0.333333
dtype: float64

Code

#imports 
import pandas as pd
import numpy as np
df = pd.read_excel('dnd-dataframe.xlsx', sheet_name=0, usecols=['name', 'weapons','herotype','spells'])
df.head(30)

dummies1 = df['weapons'].str.get_dummies(sep=',')
dummies2 = df['spells'].str.get_dummies(sep=',')
dummies3 = df['herotype'].str.get_dummies(sep=',')
genre_data = pd.concat([df, dummies1,dummies2, dummies3], axis=1)
userInput = [
 {'name':'Irv', 'rating':1} #Their is no rating system being used so by default rating is set to 1
 ] 
inputname = pd.DataFrame(userInput)
inputId = df[df['name'].isin(inputname['name'].tolist())]
#Then merging it so we can get the name. It's implicitly merging spells it by name.
inputname = pd.merge(inputId, inputname)
#Dropping information we won't use from the input dataframe
inputname = inputname.drop('weapons',1).drop('spells',1).drop('herotype',1)
#Filtering out the names from the input
username = genre_data[genre_data['name'].isin(inputname['name'].tolist())]
#Resetting the index to avoid future issues
username = username.reset_index(drop=True)
#Dropping unnecessary issues due to save memory and to avoid issues
userGenreTable = username.drop('name',1).drop('weapons',1).drop('spells',1).drop('herotype',1)
#Dot product to get weights
userProfile = userGenreTable.transpose().dot(inputname['rating'])
genreTable = genre_data.copy()
genreTable = genreTable.drop('name',1).drop('weapons',1).drop('spells',1).drop('herotype',1)
#Multiply the genres by the weights and then take the weighted average
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#df.loc[df.index.isin(recommendationTable_df.head(3).keys())] #adjust the value of 3 here
df.loc[recommendationTable_df.head(6).index, :]
asked Jun 27, 2020 at 2:41
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

The dummies{1,2,3} identifiers are kind of OK, but not very mnemonic.

Actually, there's no need to invent new names at all, we can go the DRY anonymous route.

dummies = [df[col].str.get_dummies(sep=',')
 for col in ['weapons', 'spells', 'herotype']]
genre_data = pd.concat([df] + dummies, axis=1)

Now that we have the dummies, this would be an appropriate time to drop those three source columns.


PEP-8 asks that you spell it user_input.

We're accumulating a bunch of stale variables in the global namespace, like user_input and dummies. I recommend you define the occasional helper function. That gives no-longer-needed temp variables a chance to go out of scope when the function exits.


DRY. We seem to be doing some repeated operations: dropping three source columns, checking inputname.

Maybe we need a separate username dataframe. But maybe we could have just added the occasional column to genre_data instead?


The # comments are a bit more chatty than my own style, but that's fine; I imagine you find them helpful. I will note that naming a helper function goes a long way toward explaining the Single Responsibility of a few lines of code. And if more explanation is needed, there's lots of room for that in the helper's """docstring""".

Also, a small helper function is a function that is easily unit tested.


zomg, we're .drop()ing those three source columns yet again?!? Couldn't we have banished them just once, near the beginning?


The identifier recommendationTable_df is an abomination. Please don't do that. Pick an approach to naming, perhaps the pascal / JS recommendationTableDf, or since this is Python, perhaps the PEP-8 recommendation_table_df. But mixing them won't help to improve the readability of the code.


You have a commented expression containing .head(3) - I'm sure it was helpful during development, but it's time to remove it from final production code, now.

The last line includes the magic number 6. Please give it a name. Often this is most conveniently done in a function signature:

def get_final_recommendations(num_recs=6):
 return df.loc[recommendationTable_df.head(num_recs).index, :]

Overall, I can see there was a clear emphasis on trying to make the code readable and maintainable.

("Maintenance" means new features and bug fixes.)

I would be reluctant to hand this off, in its current state, for some new hire to perform a maintenance task. The maintainer would probably need to discuss things with the original author.

Frozen in / out data, unit tests, and small helpers would help to improve the maintainability situation.

Toby Speight
87.3k14 gold badges104 silver badges322 bronze badges
answered Jan 4, 2023 at 18:52
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.