I'm trying to encode the non-numeric columns of a pandas df to numeric values. I'm using
df = df.fillna('0')
msk = np.random.rand(len(df)) < 0.8
df_train = df[msk]
df_test = df[~msk]
columns_to_encode = df.select_dtypes(exclude=[np.number]).columns
encoder_dict = {col: LabelEncoder() for col in columns_to_encode }
df_train_enc = df_train
df_test_enc = df_test
for col in columns_to_encode:
encoder_dict[col].fit_transform(df_train_enc[col])
This, however, throws an error TypeError: '<' not supported between instances of 'str' and 'float'
. What am I missing here? I thought LabelEncoder should be able to transform strings to numerics...
asked Apr 13, 2018 at 10:53
1 Answer 1
LabelEncoder
works on string labels without an issue, so, in case you have mixed types in your data (due to missing values, for example), you can use:
for col in columns_to_encode:
encoder_dict[col].fit_transform(df_train_enc[col].astype(str))
answered Apr 13, 2018 at 10:58
Sign up to request clarification or add additional context in comments.
2 Comments
Ami Tavory
Did you try
astype(str)
?lte__
Yes, that helped, will accept in 7 mins. Thank you!
lang-py
nan
values in your data, see: stackoverflow.com/q/43956705/4121573