Skip to main content
Code Review

Return to Answer

added 18 characters in body
Source Link
RomanPerekhrest
  • 5.3k
  • 1
  • 10
  • 21

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

  • stopwords.words('english') sequence.
    As clean_text(x) will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into set object for fast containment check (in if word.lower() in stopwords_english):

     stopwords_english = set(stopwords.words('english'))
    
  • instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:

     if word.lower() in stopwords_english:
     continue
    
  • subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
    Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.


Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)

Applied as:

train['question_text'].progress_apply(lambda x: list(clean_text(x)))

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

  • stopwords.words('english') sequence.
    As clean_text(x) will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into set object for fast containment check (in if word.lower() in stopwords_english):

     stopwords_english = set(stopwords.words('english'))
    
  • instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:

     if word.lower() in stopwords_english:
     continue
    
  • subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
    Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.

Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)

Applied as:

train['question_text'].progress_apply(lambda x: list(clean_text(x)))

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

  • stopwords.words('english') sequence.
    As clean_text(x) will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into set object for fast containment check (in if word.lower() in stopwords_english):

     stopwords_english = set(stopwords.words('english'))
    
  • instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:

     if word.lower() in stopwords_english:
     continue
    
  • subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
    Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.


Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)

Applied as:

train['question_text'].progress_apply(lambda x: list(clean_text(x)))
added 91 characters in body
Source Link
RomanPerekhrest
  • 5.3k
  • 1
  • 10
  • 21

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

  • stopwords.words('english') sequence.
    As clean_text(x) will be applied for eacheach column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into setset object for fast containment check (in if word.lower() in stopwords_english):

     stopwords_english = set(stopwords.words('english'))
    
  • instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:

     if word.lower() in stopwords_english:
     continue
    
  • subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
    Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.

Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)

Applied as:

train['question_text'].progress_apply(lambda x: list(clean_text(x)))

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

  • stopwords.words('english') sequence.
    As clean_text(x) will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into set object for fast containment check (in if word.lower() in stopwords_english):

     stopwords_english = set(stopwords.words('english'))
    
  • instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:

     if word.lower() in stopwords_english:
     continue
    
  • subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
    Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.

Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

  • stopwords.words('english') sequence.
    As clean_text(x) will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into set object for fast containment check (in if word.lower() in stopwords_english):

     stopwords_english = set(stopwords.words('english'))
    
  • instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:

     if word.lower() in stopwords_english:
     continue
    
  • subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
    Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.

Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)

Applied as:

train['question_text'].progress_apply(lambda x: list(clean_text(x)))
Source Link
RomanPerekhrest
  • 5.3k
  • 1
  • 10
  • 21

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

  • stopwords.words('english') sequence.
    As clean_text(x) will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into set object for fast containment check (in if word.lower() in stopwords_english):

     stopwords_english = set(stopwords.words('english'))
    
  • instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:

     if word.lower() in stopwords_english:
     continue
    
  • subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
    Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.

Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)
lang-py

AltStyle によって変換されたページ (->オリジナル) /