37

The basic task that I have at hand is

a) Read some tab separated data.

b) Do some basic preprocessing

c) For each categorical column use LabelEncoder to create a mapping. This is don somewhat like this

mapper={}
#Converting Categorical Data
for x in categorical_list:
 mapper[x]=preprocessing.LabelEncoder()
for x in categorical_list:
 df[x]=mapper[x].fit_transform(df.__getattr__(x))

where df is a pandas dataframe and categorical_list is a list of column headers that need to be transformed.

d) Train a classifier and save it to disk using pickle

e) Now in a different program, the model saved is loaded.

f) The test data is loaded and the same preprocessing is performed.

g) The LabelEncoder's are used for converting categorical data.

h) The model is used to predict.

Now the question that I have is, will the step g) work correctly?

As the documentation for LabelEncoder says

It can also be used to transform non-numerical labels (as long as 
they are hashable and comparable) to numerical labels.

So will each entry hash to the exact same value everytime?

If No, what is a good way to go about this. Any way to retrive the mappings of the encoder? Or an altogether different way from LabelEncoder?

asked Feb 22, 2015 at 10:21
3
  • You could just try this, but yes the idea is that the hash will be the same for the same inputs Commented Feb 22, 2015 at 10:55
  • Why not pickle these mappers? Commented Feb 22, 2015 at 14:08
  • I tried...It just dumps {}...how do i get those key value pairs?? Commented Feb 22, 2015 at 14:09

8 Answers 8

71

According to the LabelEncoder implementation, the pipeline you've described will work correctly if and only if you fit LabelEncoders at the test time with data that have exactly the same set of unique values.

There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder has only one property, namely, classes_. You can pickle it, and then restore like

Train:

encoder = LabelEncoder()
encoder.fit(X)
numpy.save('classes.npy', encoder.classes_)

Test

encoder = LabelEncoder()
encoder.classes_ = numpy.load('classes.npy')
# Now you should be able to use encoder
# as you would do after `fit`

This seems more efficient than refitting it using the same data.

answered Feb 22, 2015 at 14:20
Sign up to request clarification or add additional context in comments.

6 Comments

That was the first solution I thought about too. The thing is, what if I have different values for a column that I encoded before? Those unique values will not be in LabelEncoder (and also in my models). What may be the solution here?
@nope: I don't see any solutions other than to just ignore this feature, and hope the model's performance would not go down significantly.
You can create a function with a recreate option. If the dataset changes, you recreate the classes.npy file.
@nope: you can introduce an extra class to represent the unseen values for the mapping during training, and yes, that class will not be used anywhere during training. But once you start testing, you mostly likely get some unseen values. Your encoder will be able to handle that, and simply map it to class created earlier, namely, "unseen".
I managed to create the file, however, during load it comes as an empty array. Any solutions to that?
|
23

For me the easiest way was exporting LabelEncoder as .pkl file for each column. You have to export the encoder for each column after using the fit_transform() function

For example

from sklearn.preprocessing import LabelEncoder
import pickle
import pandas as pd
df_train = pd.read_csv('traing_data.csv')
le = LabelEncoder() 
df_train['Departure'] = le.fit_transform(df_train['Departure'])
#exporting the departure encoder
output = open('Departure_encoder.pkl', 'wb')
pickle.dump(le, output)
output.close()

Then in the testing project, you can load the LabelEncoder object and apply transform() function directly

from sklearn.preprocessing import LabelEncoder
import pandas as pd
df_test = pd.read_csv('testing_data.csv')
#load the encoder file
import pickle 
pkl_file = open('Departure_encoder.pkl', 'rb')
le_departure = pickle.load(pkl_file) 
pkl_file.close()
df_test['Departure'] = le_departure.transform(df_test['Departure'])
answered Apr 29, 2019 at 0:13

3 Comments

AttributeError: 'LabelEncoder' object has no attribute 'classes_'
@ArunGeorge I believe that my solution doesn't contain any mention to classes_ please try it again and tell me If I can help
Given that you might have multiple columns you would like to transform...can you also put all the variables in an sklearn pipeline and then just save 1 object?
5
from sklearn.preprocessing import LabelEncoder
import joblib
import pandas as pd
df_train = pd.read_csv('traing_data.csv')
le = LabelEncoder() 
df_train['Departure'] = le.fit_transform(df_train['Departure'])
# to save encoder 
joblib.dump(le,'labelEncoder.joblib',compress=9)
# load it when test
le=joblib.load('labelEncoder.joblib')
answered Mar 14, 2022 at 5:53

1 Comment

A code-only answer is not high quality. While this code may be useful, you can improve it by saying why it works, how it works, when it should be used, and what its limitations are. Please edit your answer to include explanation and link to relevant documentation.
3

What works for me is LabelEncoder().fit(X_train[col]), pickling these objects for each categorical column col and then reusing the same objects for transforming the same categorical column col in the validation dataset. Basically you have a label encoder object for each of your categorical columns.

  1. So fit() on training data and pickle the objects/models corresponding to each column in the training dataframe X_train.
  2. For each col in columns of validation set X_cv, load the corresponding object/model and apply the transformation by accessing the transform function as: transform(X_cv[col]).
answered Sep 23, 2016 at 11:05

Comments

1

First we must assume that in your other program there are no new labels (unseen in the first).

As Osama Ayman mentioned above, and as stated in scikit-learn's documentation "Model Persistency", you may achieve what you want by serializing the label encoder to a local file through joblib.dump after you obtain it in the first program, and deserialize it via joblib.load.

Note that pickling the entire LabelEncoder object is not the best implementation as the loaded label encoder object from the pickle file may not work as intended once your scikit-learn's version changes (e.g., upgraded). Therefore, to make sure that your saved label encoder always works consistently to whenever you created it, use joblib.

answered Sep 20, 2023 at 12:28

Comments

0

As I found no other post about nominal/categorical encoding. I expand on the above-mentioned solutions and share mine for OrdinalEncoder approach (which maybe was intended by the author anyways)

I did the following with OrdinalEncoder (but should work with LabelEncoder as well). Note, that I am using categories_ instead of classes_

  1. Create an Encoder dictionary
  2. Save it with numpy
  3. Load it with numpy
  4. Iterate over the dict and apply the transformation on each column

Note: np stands for numpy.

# ------- step 1 and 2 in the file/cell where the encoding shall be exported
 encoder_dict = dict()
 for nom in nominal_columns:
 enc = enc.fit(df[[nom]])
 df[[nom]] = enc.transform(df[[nom]])
 encoder_dict[nom] = [[str(cat) for cat in sublist] for sublist in enc.categories_]
 np.save('FILE_NAME.npy', encoder_dict)
# ------------ step 3 and 4 in the file where encoding shall be imported
enc = OrdinalEncoder()
encoder_dict = np.load('FILE_NAME.npy', allow_pickle=True).tolist()
 for nom in encoder_dict:
 for col in df.columns:
 if nom == col:
 enc.categories_ = encoder_dict[nom]
 df[[col]] = enc.transform(df[[col]])
 return df
Dharman
33.9k27 gold badges103 silver badges153 bronze badges
answered Oct 21, 2020 at 16:40

1 Comment

I did this for OneHotEncoder but I have error: AttributeError: 'OneHotEncoder' object has no attribute 'drop_idx_'
-1

If you're already saving your model via pickle, I would do the same for the pre-processing tools.

One way to do it would be combining everything into a class:

class MyClassifier():
 def load_data(self):
 ...
 def fit(self):
 self.first_column_encoder = preprocessing.LabelEncoder()
 self.first_column_encoder.fit(...)
 ...
 self.second_column_encoder = preprocessing.LabelEncoder()
 self.second_column_encoder.fit(...)
 ...
 self.model = KNearestNeighbors(...)
 self.model.fit(...)
my_classifier = MyClassifier()
my_classifier.fit()
pickle.dump(my_classifier, file)

Note: You may want to use OrdinalEncoder instead of LabelEncoder for input categories

answered Mar 11, 2021 at 13:17

Comments

-2

You can do this after you have encoded the values with the "le" object:

encoding = {}
for i in list(le.classes_):
 encoding[i]=le.transform([i])[0]

You will get the "encoding" dictionary with the encoding for later use, with pandas you can export this dictionary to a csv for example.

answered May 20, 2020 at 15:59

1 Comment

This doesn't work because OP's step e) explicitly says "in a different program".

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.