Using Scikit's LabelEncoder correctly across multiple programs

Question 1

The basic task that I have at hand is

a) Read some tab separated data.

b) Do some basic preprocessing

c) For each categorical column use LabelEncoder to create a mapping. This is don somewhat like this

mapper={}
#Converting Categorical Data
for x in categorical_list:
 mapper[x]=preprocessing.LabelEncoder()
for x in categorical_list:
 df[x]=mapper[x].fit_transform(df.__getattr__(x))

where df is a pandas dataframe and categorical_list is a list of column headers that need to be transformed.

d) Train a classifier and save it to disk using pickle

e) Now in a different program, the model saved is loaded.

f) The test data is loaded and the same preprocessing is performed.

g) The LabelEncoder's are used for converting categorical data.

h) The model is used to predict.

Now the question that I have is, will the step g) work correctly?

As the documentation for LabelEncoder says

It can also be used to transform non-numerical labels (as long as 
they are hashable and comparable) to numerical labels.

So will each entry hash to the exact same value everytime?

If No, what is a good way to go about this. Any way to retrive the mappings of the encoder? Or an altogether different way from LabelEncoder?

Question 2

You could just try this, but yes the idea is that the hash will be the same for the same inputs

Question 3

Why not pickle these mappers?

Question 4

I tried...It just dumps {}...how do i get those key value pairs??

Question 5

According to the LabelEncoder implementation, the pipeline you've described will work correctly if and only if you fit LabelEncoders at the test time with data that have exactly the same set of unique values.

There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder has only one property, namely, classes_. You can pickle it, and then restore like

Train:

encoder = LabelEncoder()
encoder.fit(X)
numpy.save('classes.npy', encoder.classes_)

Test

encoder = LabelEncoder()
encoder.classes_ = numpy.load('classes.npy')
# Now you should be able to use encoder
# as you would do after `fit`

This seems more efficient than refitting it using the same data.

Question 6

That was the first solution I thought about too. The thing is, what if I have different values for a column that I encoded before? Those unique values will not be in LabelEncoder (and also in my models). What may be the solution here?

Question 7

@nope: I don't see any solutions other than to just ignore this feature, and hope the model's performance would not go down significantly.

Question 8

You can create a function with a recreate option. If the dataset changes, you recreate the classes.npy file.

Question 9

@nope: you can introduce an extra class to represent the unseen values for the mapping during training, and yes, that class will not be used anywhere during training. But once you start testing, you mostly likely get some unseen values. Your encoder will be able to handle that, and simply map it to class created earlier, namely, "unseen".

Question 10

I managed to create the file, however, during load it comes as an empty array. Any solutions to that?

Question 11

For me the easiest way was exporting LabelEncoder as .pkl file for each column. You have to export the encoder for each column after using the fit_transform() function

For example

from sklearn.preprocessing import LabelEncoder
import pickle
import pandas as pd
df_train = pd.read_csv('traing_data.csv')
le = LabelEncoder() 
df_train['Departure'] = le.fit_transform(df_train['Departure'])
#exporting the departure encoder
output = open('Departure_encoder.pkl', 'wb')
pickle.dump(le, output)
output.close()

Then in the testing project, you can load the LabelEncoder object and apply transform() function directly

from sklearn.preprocessing import LabelEncoder
import pandas as pd
df_test = pd.read_csv('testing_data.csv')
#load the encoder file
import pickle 
pkl_file = open('Departure_encoder.pkl', 'rb')
le_departure = pickle.load(pkl_file) 
pkl_file.close()
df_test['Departure'] = le_departure.transform(df_test['Departure'])

Question 12

AttributeError: 'LabelEncoder' object has no attribute 'classes_'

Question 13

@ArunGeorge I believe that my solution doesn't contain any mention to classes_ please try it again and tell me If I can help

Question 14

Given that you might have multiple columns you would like to transform...can you also put all the variables in an sklearn pipeline and then just save 1 object?

Question 15

from sklearn.preprocessing import LabelEncoder
import joblib
import pandas as pd
df_train = pd.read_csv('traing_data.csv')
le = LabelEncoder() 
df_train['Departure'] = le.fit_transform(df_train['Departure'])
# to save encoder 
joblib.dump(le,'labelEncoder.joblib',compress=9)
# load it when test
le=joblib.load('labelEncoder.joblib')

Question 16

A code-only answer is not high quality. While this code may be useful, you can improve it by saying why it works, how it works, when it should be used, and what its limitations are. Please edit your answer to include explanation and link to relevant documentation.

Question 17

What works for me is LabelEncoder().fit(X_train[col]), pickling these objects for each categorical column col and then reusing the same objects for transforming the same categorical column col in the validation dataset. Basically you have a label encoder object for each of your categorical columns.

So fit() on training data and pickle the objects/models corresponding to each column in the training dataframe X_train.
For each col in columns of validation set X_cv, load the corresponding object/model and apply the transformation by accessing the transform function as: transform(X_cv[col]).

Question 18

First we must assume that in your other program there are no new labels (unseen in the first).

As Osama Ayman mentioned above, and as stated in scikit-learn's documentation "Model Persistency", you may achieve what you want by serializing the label encoder to a local file through joblib.dump after you obtain it in the first program, and deserialize it via joblib.load.

Note that pickling the entire LabelEncoder object is not the best implementation as the loaded label encoder object from the pickle file may not work as intended once your scikit-learn's version changes (e.g., upgraded). Therefore, to make sure that your saved label encoder always works consistently to whenever you created it, use joblib.

Question 19

As I found no other post about nominal/categorical encoding. I expand on the above-mentioned solutions and share mine for OrdinalEncoder approach (which maybe was intended by the author anyways)

I did the following with OrdinalEncoder (but should work with LabelEncoder as well). Note, that I am using categories_ instead of classes_

Create an Encoder dictionary
Save it with numpy
Load it with numpy
Iterate over the dict and apply the transformation on each column

Note: np stands for numpy.

# ------- step 1 and 2 in the file/cell where the encoding shall be exported
 encoder_dict = dict()
 for nom in nominal_columns:
 enc = enc.fit(df[[nom]])
 df[[nom]] = enc.transform(df[[nom]])
 encoder_dict[nom] = [[str(cat) for cat in sublist] for sublist in enc.categories_]
 np.save('FILE_NAME.npy', encoder_dict)
# ------------ step 3 and 4 in the file where encoding shall be imported
enc = OrdinalEncoder()
encoder_dict = np.load('FILE_NAME.npy', allow_pickle=True).tolist()
 for nom in encoder_dict:
 for col in df.columns:
 if nom == col:
 enc.categories_ = encoder_dict[nom]
 df[[col]] = enc.transform(df[[col]])
 return df

Question 20

I did this for OneHotEncoder but I have error: AttributeError: 'OneHotEncoder' object has no attribute 'drop_idx_'

Question 21

If you're already saving your model via pickle, I would do the same for the pre-processing tools.

One way to do it would be combining everything into a class:

class MyClassifier():
 def load_data(self):
 ...
 def fit(self):
 self.first_column_encoder = preprocessing.LabelEncoder()
 self.first_column_encoder.fit(...)
 ...
 self.second_column_encoder = preprocessing.LabelEncoder()
 self.second_column_encoder.fit(...)
 ...
 self.model = KNearestNeighbors(...)
 self.model.fit(...)

my_classifier = MyClassifier()
my_classifier.fit()
pickle.dump(my_classifier, file)

Note: You may want to use OrdinalEncoder instead of LabelEncoder for input categories

Question 22

You can do this after you have encoded the values with the "le" object:

encoding = {}
for i in list(le.classes_):
 encoding[i]=le.transform([i])[0]

You will get the "encoding" dictionary with the encoding for later use, with pandas you can export this dictionary to a csv for example.

Question 23

This doesn't work because OP's step e) explicitly says "in a different program".

Artem Sobolev Artem Sobolev 6,1191 gold badge25 silver badges41 bronze badges · Accepted Answer · 2015-02-22 14:20:56Z

71

According to the LabelEncoder implementation, the pipeline you've described will work correctly if and only if you fit LabelEncoders at the test time with data that have exactly the same set of unique values.

There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder has only one property, namely, classes_. You can pickle it, and then restore like

Train:

encoder = LabelEncoder()
encoder.fit(X)
numpy.save('classes.npy', encoder.classes_)

Test

encoder = LabelEncoder()
encoder.classes_ = numpy.load('classes.npy')
# Now you should be able to use encoder
# as you would do after `fit`

This seems more efficient than refitting it using the same data.

Share

Improve this answer

edited Mar 30, 2023 at 18:24

answered Feb 22, 2015 at 14:20

Artem Sobolev's user avatar

Artem Sobolev Artem Sobolev

6,1191 gold badge25 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

nope

nope Over a year ago

That was the first solution I thought about too. The thing is, what if I have different values for a column that I encoded before? Those unique values will not be in LabelEncoder (and also in my models). What may be the solution here?

2017年05月17日T05:50:40.03Z+00:00

Artem Sobolev

Artem Sobolev Over a year ago

@nope: I don't see any solutions other than to just ignore this feature, and hope the model's performance would not go down significantly.

2017年05月22日T07:26:58.33Z+00:00

ricoms

ricoms Over a year ago

You can create a function with a recreate option. If the dataset changes, you recreate the classes.npy file.

2018年06月07日T16:20:10.63Z+00:00

Uylenburgh

Uylenburgh Over a year ago

@nope: you can introduce an extra class to represent the unseen values for the mapping during training, and yes, that class will not be used anywhere during training. But once you start testing, you mostly likely get some unseen values. Your encoder will be able to handle that, and simply map it to class created earlier, namely, "unseen".

2018年11月26日T09:49:42.733Z+00:00

Daniel Vilas-Boas

Daniel Vilas-Boas Over a year ago

I managed to create the file, however, during load it comes as an empty array. Any solutions to that?

2019年12月01日T18:37:45.58Z+00:00

|

CollectivesTM on Stack Overflow

Using Scikit's LabelEncoder correctly across multiple programs

8 Answers 8

6 Comments

3 Comments

1 Comment

Comments

Comments

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

8 Answers 8

6 Comments

3 Comments

1 Comment

Comments

Comments

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related