Pandas replace rare values in a pipeline

Question 1

A common preprocessing in machine learning consists in replacing rare values in the data by a label stating "rare". So that subsequent learning algorithms will not try to generalize a value with few occurences.

Pipelines enable to describe a sequence of preprocessing and learning algorithms to end up with a single object that takes raw data, treats it, and output a prediction. scikit-learn expects the steps to have a specific syntax (fit / transform or fit / predict). I wrote the following class to take care of this task so that it can be run inside a pipeline. (More details about the motivation can be found here: pandas replace rare values)

Is there a way to improve this code in term of performance or reusability ?

class RemoveScarceValuesFeatureEngineer:
 def __init__(self, min_occurences):
 self._min_occurences = min_occurences
 self._column_value_counts = {}
 def fit(self, X, y):
 for column in X.columns:
 self._column_value_counts[column] = X[column].value_counts()
 return self
 def transform(self, X):
 for column in X.columns:
 X.loc[self._column_value_counts[column][X[column]].values
 < self._min_occurences, column] = "RARE_VALUE"
 return X
 def fit_transform(self, X, y):
 self.fit(X, y)
 return self.transform(X)

And the following can be appended to the above class to make sure the methods work as expected:

if __name__ == "__main__":
 import pandas as pd
 sample_train = pd.DataFrame(
 [{"a": 1, "s": "a"}, {"a": 1, "s": "a"}, {"a": 1, "s": "b"}])
 rssfe = RemoveScarceValuesFeatureEngineer(2)
 print(sample_train)
 print(rssfe.fit_transform(sample_train, None))
 print(20*"=")
 sample_test = pd.DataFrame([{"a": 1, "s": "a"}, {"a": 1, "s": "b"}])
 print(sample_test)
 print(rssfe.transform(sample_test))

Question 2

You could use scikit-learn's TransformerMixin which provides an implementation of fit_transform for you (its implementation is available here for interest).
I'd consider renaming RemoveScarceValuesFeatureEngineer to something that fits a bit more with other classes in scikit-learn. How about RareValueTransformer instead?
What do you want to happen if an unseen value is transformed? Take, for example

sample_test = pd.DataFrame([{"a": 1, "s": "a"}, {"a": 2, "s": "b"}])
print(sample_test)
print(rssfe.transform(sample_test))

This raises a KeyError, which isn't what I expected. I'd either rework your code to ignore unseen values, or return a nicer error if this is what you want to happen. To me, ignoring seems more reasonable, but it's up to you! Making some unit tests would give you more confidence in cases like this, too.

A pedantic aside: you have a typo of 'occurrence' in min_occurences, which is easily amended.

htl htl 5462 silver badges8 bronze badges · Answer 1 · 2021-04-04 15:57:13Z

You could use scikit-learn's TransformerMixin which provides an implementation of fit_transform for you (its implementation is available here for interest).
I'd consider renaming RemoveScarceValuesFeatureEngineer to something that fits a bit more with other classes in scikit-learn. How about RareValueTransformer instead?
What do you want to happen if an unseen value is transformed? Take, for example

sample_test = pd.DataFrame([{"a": 1, "s": "a"}, {"a": 2, "s": "b"}])
print(sample_test)
print(rssfe.transform(sample_test))

This raises a KeyError, which isn't what I expected. I'd either rework your code to ignore unseen values, or return a nicer error if this is what you want to happen. To me, ignoring seems more reasonable, but it's up to you! Making some unit tests would give you more confidence in cases like this, too.

A pedantic aside: you have a typo of 'occurrence' in min_occurences, which is easily amended.

Stack Exchange Network

Pandas replace rare values in a pipeline

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Pandas replace rare values in a pipeline

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions