Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.
Here are my arguments:
stopwords.words('english')
sequence.
Asclean_text(x)
will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy.stopwords.words('english')
is actually a list of stopwords and a much more efficient would be to convert it intoset
object for fast containment check (inif word.lower() in stopwords_english
):stopwords_english = set(stopwords.words('english'))
instead of yielding a words that aren't contained in
stopwords_english
set for further replacements - in opposite, words that are stopwords can be just skipped at once:if word.lower() in stopwords_english: continue
subtle nuance: the pattern
"/-'"
at 1st replacement attempt (for punct in "/-'"
) is actually contained in a longer pattern of punctuation chars?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'
.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with+
quantifier (to replace multiple occurrences at once) defined at top level.
Finally, the optimized approach would look as follows:
import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
for word in x:
if word.lower() in stopwords_english:
continue
if '&' in word:
word = word.replace('&', ' & ')
yield punct_pat.sub('', word)
Applied as:
train['question_text'].progress_apply(lambda x: list(clean_text(x)))
Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.
Here are my arguments:
stopwords.words('english')
sequence.
Asclean_text(x)
will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy.stopwords.words('english')
is actually a list of stopwords and a much more efficient would be to convert it intoset
object for fast containment check (inif word.lower() in stopwords_english
):stopwords_english = set(stopwords.words('english'))
instead of yielding a words that aren't contained in
stopwords_english
set for further replacements - in opposite, words that are stopwords can be just skipped at once:if word.lower() in stopwords_english: continue
subtle nuance: the pattern
"/-'"
at 1st replacement attempt (for punct in "/-'"
) is actually contained in a longer pattern of punctuation chars?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'
.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with+
quantifier (to replace multiple occurrences at once) defined at top level.
Finally, the optimized approach would look as follows:
import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
for word in x:
if word.lower() in stopwords_english:
continue
if '&' in word:
word = word.replace('&', ' & ')
yield punct_pat.sub('', word)
Applied as:
train['question_text'].progress_apply(lambda x: list(clean_text(x)))
Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.
Here are my arguments:
stopwords.words('english')
sequence.
Asclean_text(x)
will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy.stopwords.words('english')
is actually a list of stopwords and a much more efficient would be to convert it intoset
object for fast containment check (inif word.lower() in stopwords_english
):stopwords_english = set(stopwords.words('english'))
instead of yielding a words that aren't contained in
stopwords_english
set for further replacements - in opposite, words that are stopwords can be just skipped at once:if word.lower() in stopwords_english: continue
subtle nuance: the pattern
"/-'"
at 1st replacement attempt (for punct in "/-'"
) is actually contained in a longer pattern of punctuation chars?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'
.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with+
quantifier (to replace multiple occurrences at once) defined at top level.
Finally, the optimized approach would look as follows:
import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
for word in x:
if word.lower() in stopwords_english:
continue
if '&' in word:
word = word.replace('&', ' & ')
yield punct_pat.sub('', word)
Applied as:
train['question_text'].progress_apply(lambda x: list(clean_text(x)))
Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.
Here are my arguments:
stopwords.words('english')
sequence.
Asclean_text(x)
will be applied for eacheach column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy.stopwords.words('english')
is actually a list of stopwords and a much more efficient would be to convert it intoset
set
object for fast containment check (inif word.lower() in stopwords_english
):stopwords_english = set(stopwords.words('english'))
instead of yielding a words that aren't contained in
stopwords_english
set for further replacements - in opposite, words that are stopwords can be just skipped at once:if word.lower() in stopwords_english: continue
subtle nuance: the pattern
"/-'"
at 1st replacement attempt (for punct in "/-'"
) is actually contained in a longer pattern of punctuation chars?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'
.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with+
quantifier (to replace multiple occurrences at once) defined at top level.
Finally, the optimized approach would look as follows:
import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
for word in x:
if word.lower() in stopwords_english:
continue
if '&' in word:
word = word.replace('&', ' & ')
yield punct_pat.sub('', word)
Applied as:
train['question_text'].progress_apply(lambda x: list(clean_text(x)))
Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.
Here are my arguments:
stopwords.words('english')
sequence.
Asclean_text(x)
will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy.stopwords.words('english')
is actually a list of stopwords and a much more efficient would be to convert it intoset
object for fast containment check (inif word.lower() in stopwords_english
):stopwords_english = set(stopwords.words('english'))
instead of yielding a words that aren't contained in
stopwords_english
set for further replacements - in opposite, words that are stopwords can be just skipped at once:if word.lower() in stopwords_english: continue
subtle nuance: the pattern
"/-'"
at 1st replacement attempt (for punct in "/-'"
) is actually contained in a longer pattern of punctuation chars?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'
.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with+
quantifier (to replace multiple occurrences at once) defined at top level.
Finally, the optimized approach would look as follows:
import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
for word in x:
if word.lower() in stopwords_english:
continue
if '&' in word:
word = word.replace('&', ' & ')
yield punct_pat.sub('', word)
Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.
Here are my arguments:
stopwords.words('english')
sequence.
Asclean_text(x)
will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy.stopwords.words('english')
is actually a list of stopwords and a much more efficient would be to convert it intoset
object for fast containment check (inif word.lower() in stopwords_english
):stopwords_english = set(stopwords.words('english'))
instead of yielding a words that aren't contained in
stopwords_english
set for further replacements - in opposite, words that are stopwords can be just skipped at once:if word.lower() in stopwords_english: continue
subtle nuance: the pattern
"/-'"
at 1st replacement attempt (for punct in "/-'"
) is actually contained in a longer pattern of punctuation chars?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'
.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with+
quantifier (to replace multiple occurrences at once) defined at top level.
Finally, the optimized approach would look as follows:
import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
for word in x:
if word.lower() in stopwords_english:
continue
if '&' in word:
word = word.replace('&', ' & ')
yield punct_pat.sub('', word)
Applied as:
train['question_text'].progress_apply(lambda x: list(clean_text(x)))
Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.
Here are my arguments:
stopwords.words('english')
sequence.
Asclean_text(x)
will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy.stopwords.words('english')
is actually a list of stopwords and a much more efficient would be to convert it intoset
object for fast containment check (inif word.lower() in stopwords_english
):stopwords_english = set(stopwords.words('english'))
instead of yielding a words that aren't contained in
stopwords_english
set for further replacements - in opposite, words that are stopwords can be just skipped at once:if word.lower() in stopwords_english: continue
subtle nuance: the pattern
"/-'"
at 1st replacement attempt (for punct in "/-'"
) is actually contained in a longer pattern of punctuation chars?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'
.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with+
quantifier (to replace multiple occurrences at once) defined at top level.
Finally, the optimized approach would look as follows:
import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
for word in x:
if word.lower() in stopwords_english:
continue
if '&' in word:
word = word.replace('&', ' & ')
yield punct_pat.sub('', word)