I wrote a function that I use to preprocess pandas dataframes before running them through a machine learning model. The function works perfectly, however I don't think it's the most pythonic way to write it.
This function accepts a list of words:
['here', 'is', 'a','sample', 'of','what','the','function','accepts']
def clean_text(x):
stopwords_english = stopwords.words('english')
for i,word in enumerate(x):
if word.lower() in stopwords_english:
x[i] = ''
else:
for punct in "/-'":
x[i] = word.replace(punct, ' ')
for punct in '&':
x[i] = word.replace(punct, f' {punct} ')
for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’':
x[i] = word.replace(punct, '')
return x
Here I am using enumerate
to change the value inside of the list. I would have assumed a more pythonic way of doing it would be writing it as follows:
def clean_text(x):
stopwords_english = stopwords.words('english')
for word in x:
if word.lower() in stopwords_english:
word = ''
else:
for punct in "/-'":
word = word.replace(punct, ' ')
for punct in '&':
word = word.replace(punct, f' {punct} ')
for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’':
word = word.replace(punct, '')
return x
The function is being called as follows:
train['question_text'].progress_apply(lambda x: clean_text(x))
Where train
is a pandas dataframe and 'question_text' is a column in the dataframe.
Is my current implementation the most pythonic way, or is there a better way?
2 Answers 2
This may be a good case for generator functions. And splitting it into two parts might make things more flexible: first, remove the stopwords; and second, handle the punctuation.
Also, str.maketrans
and str.translate
can do the punctuation mapping.
def remove_stopwords(text_iterable, stopwords):
for word in text_iterable:
if word.lower() not in stopwords:
yield word
def handle_punctuation(text_iterable, table):
for word in text_iterable:
yield word.translate(table)
# format is ("chars", "replacement")
mappings = (("/-'", ' '),
('&', ' & '),
('?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’', None))
table = str.maketrans({c:v for k,v in mappings for c in k})
stopword_free = remove_stopwords(text, stopwords)
cleaned_text = handle_punctuation(stopword_free, table)
cleaned_text
is a generator, use list(handle_punctuation(stopword_free, table))
if you need an actual list of words.
-
\$\begingroup\$ Thanks for the answer, is there a specific advantage to using a generator in the case? I am passing the function as a lamda function as follows: train['question_text'].progress_apply(lambda x: list(remove_stopwords(x))) \$\endgroup\$Abed Merii– Abed Merii2019年12月09日 19:28:02 +00:00Commented Dec 9, 2019 at 19:28
-
1\$\begingroup\$ I find generator functions work well as filters. Small ones are easy to understand (and code). And they can be composed to make more complicated filters. For example, if you wanted to also remove words containing foreign letters. It would be relatively easy to add another filter function generator to the chain of filters.. \$\endgroup\$RootTwo– RootTwo2019年12月09日 20:32:03 +00:00Commented Dec 9, 2019 at 20:32
-
\$\begingroup\$ Its far easier to manipulate generator code because you avoid hard-coding your values also if you need to change something in your code in the future you only have to change it in 1 place. \$\endgroup\$Barb– Barb2019年12月10日 08:08:13 +00:00Commented Dec 10, 2019 at 8:08
Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.
Here are my arguments:
stopwords.words('english')
sequence.
Asclean_text(x)
will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy.stopwords.words('english')
is actually a list of stopwords and a much more efficient would be to convert it intoset
object for fast containment check (inif word.lower() in stopwords_english
):stopwords_english = set(stopwords.words('english'))
instead of yielding a words that aren't contained in
stopwords_english
set for further replacements - in opposite, words that are stopwords can be just skipped at once:if word.lower() in stopwords_english: continue
subtle nuance: the pattern
"/-'"
at 1st replacement attempt (for punct in "/-'"
) is actually contained in a longer pattern of punctuation chars?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'
.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with+
quantifier (to replace multiple occurrences at once) defined at top level.
Finally, the optimized approach would look as follows:
import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
for word in x:
if word.lower() in stopwords_english:
continue
if '&' in word:
word = word.replace('&', ' & ')
yield punct_pat.sub('', word)
Applied as:
train['question_text'].progress_apply(lambda x: list(clean_text(x)))
-
\$\begingroup\$ How about one more step to make it use list comprehension? Something like
punct_pat.sub('', word.replace('&', ' & ')) for word in x if word.lower() not in stopwords_english
\$\endgroup\$JollyJoker– JollyJoker2019年12月10日 12:16:22 +00:00Commented Dec 10, 2019 at 12:16 -
\$\begingroup\$ @JollyJoker, for simple cases and if the function is unlikely to be extended - yes, list comprehension could be straightforward and good. But I've use a single
for
loop to show how the flow goes and it's easier to extend it with potential additional transformations/substitutions without loss of readability. \$\endgroup\$RomanPerekhrest– RomanPerekhrest2019年12月10日 13:34:30 +00:00Commented Dec 10, 2019 at 13:34
clean_text
topandas.Series
sequence? Post a context of calling \$\endgroup\$train['question_text']
column contains a list of words in each cell , imagine that after replacement the resulting list could have multiple gaps like['', 'is', 'a','', 'of','what','the','','']
- is that expected in your case? or the result could be returned as a plain text ? \$\endgroup\$'&'
with regex that looks for an ampersand that isn't surrounded by spaces. \$\endgroup\$