The basic task that I have at hand is
a) Read some tab separated data.
b) Do some basic preprocessing
c) For each categorical column use LabelEncoder
to create a mapping. This is don somewhat like this
mapper={}
#Converting Categorical Data
for x in categorical_list:
mapper[x]=preprocessing.LabelEncoder()
for x in categorical_list:
df[x]=mapper[x].fit_transform(df.__getattr__(x))
where df
is a pandas dataframe and categorical_list
is a list of column headers that need to be transformed.
d) Train a classifier and save it to disk using pickle
e) Now in a different program, the model saved is loaded.
f) The test data is loaded and the same preprocessing is performed.
g) The LabelEncoder's
are used for converting categorical data.
h) The model is used to predict.
Now the question that I have is, will the step g)
work correctly?
As the documentation for LabelEncoder
says
It can also be used to transform non-numerical labels (as long as
they are hashable and comparable) to numerical labels.
So will each entry hash to the exact same value everytime?
If No, what is a good way to go about this. Any way to retrive the mappings of the encoder? Or an altogether different way from LabelEncoder?
8 Answers 8
According to the LabelEncoder
implementation, the pipeline you've described will work correctly if and only if you fit
LabelEncoders at the test time with data that have exactly the same set of unique values.
There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder
has only one property, namely, classes_
. You can pickle it, and then restore like
Train:
encoder = LabelEncoder()
encoder.fit(X)
numpy.save('classes.npy', encoder.classes_)
Test
encoder = LabelEncoder()
encoder.classes_ = numpy.load('classes.npy')
# Now you should be able to use encoder
# as you would do after `fit`
This seems more efficient than refitting it using the same data.
6 Comments
classes.npy
file.For me the easiest way was exporting LabelEncoder as .pkl
file for each column. You have to export the encoder for each column after using the fit_transform()
function
For example
from sklearn.preprocessing import LabelEncoder
import pickle
import pandas as pd
df_train = pd.read_csv('traing_data.csv')
le = LabelEncoder()
df_train['Departure'] = le.fit_transform(df_train['Departure'])
#exporting the departure encoder
output = open('Departure_encoder.pkl', 'wb')
pickle.dump(le, output)
output.close()
Then in the testing project, you can load the LabelEncoder object and apply transform()
function directly
from sklearn.preprocessing import LabelEncoder
import pandas as pd
df_test = pd.read_csv('testing_data.csv')
#load the encoder file
import pickle
pkl_file = open('Departure_encoder.pkl', 'rb')
le_departure = pickle.load(pkl_file)
pkl_file.close()
df_test['Departure'] = le_departure.transform(df_test['Departure'])
3 Comments
classes_
please try it again and tell me If I can helpfrom sklearn.preprocessing import LabelEncoder
import joblib
import pandas as pd
df_train = pd.read_csv('traing_data.csv')
le = LabelEncoder()
df_train['Departure'] = le.fit_transform(df_train['Departure'])
# to save encoder
joblib.dump(le,'labelEncoder.joblib',compress=9)
# load it when test
le=joblib.load('labelEncoder.joblib')
1 Comment
What works for me is LabelEncoder().fit(X_train[col])
, pickling these objects for each categorical column col
and then reusing the same objects for transforming the same categorical column col
in the validation dataset. Basically you have a label encoder object for each of your categorical columns.
- So
fit()
on training data and pickle the objects/models corresponding to each column in the training dataframeX_train
. - For each
col
in columns of validation setX_cv
, load the corresponding object/model and apply the transformation by accessing the transform function as:transform(X_cv[col])
.
Comments
First we must assume that in your other program there are no new labels (unseen in the first).
As Osama Ayman mentioned above, and as stated in scikit-learn's documentation "Model Persistency", you may achieve what you want by serializing the label encoder to a local file through joblib.dump
after you obtain it in the first program, and deserialize it via joblib.load
.
Note that pickling the entire LabelEncoder object is not the best implementation as the loaded label encoder object from the pickle file may not work as intended once your scikit-learn's version changes (e.g., upgraded). Therefore, to make sure that your saved label encoder always works consistently to whenever you created it, use joblib.
Comments
As I found no other post about nominal/categorical encoding. I expand on the above-mentioned solutions and share mine for OrdinalEncoder approach (which maybe was intended by the author anyways)
I did the following with OrdinalEncoder (but should work with LabelEncoder as well). Note, that I am using categories_
instead of classes_
- Create an Encoder dictionary
- Save it with numpy
- Load it with numpy
- Iterate over the dict and apply the transformation on each column
Note: np
stands for numpy.
# ------- step 1 and 2 in the file/cell where the encoding shall be exported
encoder_dict = dict()
for nom in nominal_columns:
enc = enc.fit(df[[nom]])
df[[nom]] = enc.transform(df[[nom]])
encoder_dict[nom] = [[str(cat) for cat in sublist] for sublist in enc.categories_]
np.save('FILE_NAME.npy', encoder_dict)
# ------------ step 3 and 4 in the file where encoding shall be imported
enc = OrdinalEncoder()
encoder_dict = np.load('FILE_NAME.npy', allow_pickle=True).tolist()
for nom in encoder_dict:
for col in df.columns:
if nom == col:
enc.categories_ = encoder_dict[nom]
df[[col]] = enc.transform(df[[col]])
return df
1 Comment
If you're already saving your model via pickle, I would do the same for the pre-processing tools.
One way to do it would be combining everything into a class:
class MyClassifier():
def load_data(self):
...
def fit(self):
self.first_column_encoder = preprocessing.LabelEncoder()
self.first_column_encoder.fit(...)
...
self.second_column_encoder = preprocessing.LabelEncoder()
self.second_column_encoder.fit(...)
...
self.model = KNearestNeighbors(...)
self.model.fit(...)
my_classifier = MyClassifier()
my_classifier.fit()
pickle.dump(my_classifier, file)
Note: You may want to use OrdinalEncoder instead of LabelEncoder for input categories
Comments
You can do this after you have encoded the values with the "le" object:
encoding = {}
for i in list(le.classes_):
encoding[i]=le.transform([i])[0]
You will get the "encoding" dictionary with the encoding for later use, with pandas you can export this dictionary to a csv for example.
mapper
s?