Manipulating list values in python

Question 1

I wrote a function that I use to preprocess pandas dataframes before running them through a machine learning model. The function works perfectly, however I don't think it's the most pythonic way to write it.

This function accepts a list of words:

['here', 'is', 'a','sample', 'of','what','the','function','accepts']

def clean_text(x):
 stopwords_english = stopwords.words('english')
 for i,word in enumerate(x):
 if word.lower() in stopwords_english:
 x[i] = ''
 else:
 for punct in "/-'":
 x[i] = word.replace(punct, ' ')
 for punct in '&':
 x[i] = word.replace(punct, f' {punct} ')
 for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’':
 x[i] = word.replace(punct, '')
 return x

Here I am using enumerate to change the value inside of the list. I would have assumed a more pythonic way of doing it would be writing it as follows:

def clean_text(x):
 stopwords_english = stopwords.words('english')
 for word in x:
 if word.lower() in stopwords_english:
 word = ''
 else:
 for punct in "/-'":
 word = word.replace(punct, ' ')
 for punct in '&':
 word = word.replace(punct, f' {punct} ')
 for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’':
 word = word.replace(punct, '')
 return x

The function is being called as follows:

train['question_text'].progress_apply(lambda x: clean_text(x))

Where train is a pandas dataframe and 'question_text' is a column in the dataframe.

Is my current implementation the most pythonic way, or is there a better way?

Question 2

Are you applying clean_text to pandas.Series sequence? Post a context of calling

Question 3

@RomanPerekhrest Thanks for the heads up, I edited it and included how the function is called. Please note that i am using progress_apply because i am using the tqdm library to monitor the progress.

Question 4

@A Merii, if initially train['question_text'] column contains a list of words in each cell , imagine that after replacement the resulting list could have multiple gaps like ['', 'is', 'a','', 'of','what','the','',''] - is that expected in your case? or the result could be returned as a plain text ?

Question 5

@RomanPerekhrest Funny thing is, I was just revising my code then i realized that i need to tweak the function a bit to actually drop the empty cells rather than replace them with an empty string. Please correct me if I am wrong but the best way to do that would be to use the 'del' operator?

Question 6

BTW, your code isn't stable, i.e. your function applied twice will not yield the same result as if you apply your function once. You might want to replace '&' with regex that looks for an ampersand that isn't surrounded by spaces.

Question 7

This may be a good case for generator functions. And splitting it into two parts might make things more flexible: first, remove the stopwords; and second, handle the punctuation.

Also, str.maketrans and str.translate can do the punctuation mapping.

def remove_stopwords(text_iterable, stopwords):
 for word in text_iterable:
 if word.lower() not in stopwords:
 yield word
def handle_punctuation(text_iterable, table):
 for word in text_iterable:
 yield word.translate(table)
# format is ("chars", "replacement") 
mappings = (("/-'", ' '),
 ('&', ' & '),
 ('?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’', None))
table = str.maketrans({c:v for k,v in mappings for c in k})
stopword_free = remove_stopwords(text, stopwords)
cleaned_text = handle_punctuation(stopword_free, table)

cleaned_text is a generator, use list(handle_punctuation(stopword_free, table)) if you need an actual list of words.

Question 8

Thanks for the answer, is there a specific advantage to using a generator in the case? I am passing the function as a lamda function as follows: train['question_text'].progress_apply(lambda x: list(remove_stopwords(x)))

Question 9

I find generator functions work well as filters. Small ones are easy to understand (and code). And they can be composed to make more complicated filters. For example, if you wanted to also remove words containing foreign letters. It would be relatively easy to add another filter function generator to the chain of filters..

Question 10

Its far easier to manipulate generator code because you avoid hard-coding your values also if you need to change something in your code in the future you only have to change it in 1 place.

Question 11

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

stopwords.words('english') sequence.
As clean_text(x) will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into set object for fast containment check (in if word.lower() in stopwords_english):
```
stopwords_english = set(stopwords.words('english'))
```
instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:
```
if word.lower() in stopwords_english:
 continue
```
subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.

Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)

Applied as:

train['question_text'].progress_apply(lambda x: list(clean_text(x)))

Question 12

How about one more step to make it use list comprehension? Something like punct_pat.sub('', word.replace('&', ' & ')) for word in x if word.lower() not in stopwords_english

Question 13

@JollyJoker, for simple cases and if the function is unlikely to be extended - yes, list comprehension could be straightforward and good. But I've use a single for loop to show how the flow goes and it's easier to extend it with potential additional transformations/substitutions without loss of readability.

RootTwo RootTwoRootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Accepted Answer · 2019-12-09 19:06:10Z

This may be a good case for generator functions. And splitting it into two parts might make things more flexible: first, remove the stopwords; and second, handle the punctuation.

Also, str.maketrans and str.translate can do the punctuation mapping.

def remove_stopwords(text_iterable, stopwords):
 for word in text_iterable:
 if word.lower() not in stopwords:
 yield word
def handle_punctuation(text_iterable, table):
 for word in text_iterable:
 yield word.translate(table)
# format is ("chars", "replacement") 
mappings = (("/-'", ' '),
 ('&', ' & '),
 ('?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’', None))
table = str.maketrans({c:v for k,v in mappings for c in k})
stopword_free = remove_stopwords(text, stopwords)
cleaned_text = handle_punctuation(stopword_free, table)

cleaned_text is a generator, use list(handle_punctuation(stopword_free, table)) if you need an actual list of words.

Thanks for the answer, is there a specific advantage to using a generator in the case? I am passing the function as a lamda function as follows: train['question_text'].progress_apply(lambda x: list(remove_stopwords(x)))
I find generator functions work well as filters. Small ones are easy to understand (and code). And they can be composed to make more complicated filters. For example, if you wanted to also remove words containing foreign letters. It would be relatively easy to add another filter function generator to the chain of filters..
Its far easier to manipulate generator code because you avoid hard-coding your values also if you need to change something in your code in the future you only have to change it in 1 place.

Stack Exchange Network

Manipulating list values in python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Manipulating list values in python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions