A common preprocessing in machine learning consists in replacing rare values in the data by a label stating "rare". So that subsequent learning algorithms will not try to generalize a value with few occurences.
Pipelines enable to describe a sequence of preprocessing and learning algorithms to end up with a single object that takes raw data, treats it, and output a prediction. scikit-learn expects the steps to have a specific syntax (fit / transform or fit / predict). I wrote the following class to take care of this task so that it can be run inside a pipeline. (More details about the motivation can be found here: pandas replace rare values)
Is there a way to improve this code in term of performance or reusability ?
class RemoveScarceValuesFeatureEngineer:
def __init__(self, min_occurences):
self._min_occurences = min_occurences
self._column_value_counts = {}
def fit(self, X, y):
for column in X.columns:
self._column_value_counts[column] = X[column].value_counts()
return self
def transform(self, X):
for column in X.columns:
X.loc[self._column_value_counts[column][X[column]].values
< self._min_occurences, column] = "RARE_VALUE"
return X
def fit_transform(self, X, y):
self.fit(X, y)
return self.transform(X)
And the following can be appended to the above class to make sure the methods work as expected:
if __name__ == "__main__":
import pandas as pd
sample_train = pd.DataFrame(
[{"a": 1, "s": "a"}, {"a": 1, "s": "a"}, {"a": 1, "s": "b"}])
rssfe = RemoveScarceValuesFeatureEngineer(2)
print(sample_train)
print(rssfe.fit_transform(sample_train, None))
print(20*"=")
sample_test = pd.DataFrame([{"a": 1, "s": "a"}, {"a": 1, "s": "b"}])
print(sample_test)
print(rssfe.transform(sample_test))
1 Answer 1
You could use scikit-learn's
TransformerMixin
which provides an implementation offit_transform
for you (its implementation is available here for interest).I'd consider renaming
RemoveScarceValuesFeatureEngineer
to something that fits a bit more with other classes in scikit-learn. How aboutRareValueTransformer
instead?What do you want to happen if an unseen value is transformed? Take, for example
sample_test = pd.DataFrame([{"a": 1, "s": "a"}, {"a": 2, "s": "b"}])
print(sample_test)
print(rssfe.transform(sample_test))
This raises a KeyError
, which isn't what I expected. I'd either rework your code to ignore unseen values, or return a nicer error if this is what you want to happen. To me, ignoring seems more reasonable, but it's up to you! Making some unit tests would give you more confidence in cases like this, too.
A pedantic aside: you have a typo of 'occurrence' in min_occurences
, which is easily amended.