Return to Answer

added 18 characters in body

Source Link

edited Dec 9, 2019 at 21:23

RomanPerekhrest

edited Dec 9, 2019 at 21:23

RomanPerekhrest

5.3k
1
10
21

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

stopwords.words('english') sequence.
As clean_text(x) will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into set object for fast containment check (in if word.lower() in stopwords_english):
```
 stopwords_english = set(stopwords.words('english'))
```
instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:
```
 if word.lower() in stopwords_english:
 continue
```
subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.

Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)

Applied as:

train['question_text'].progress_apply(lambda x: list(clean_text(x)))

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

stopwords.words('english') sequence.
As clean_text(x) will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into set object for fast containment check (in if word.lower() in stopwords_english):
```
 stopwords_english = set(stopwords.words('english'))
```
instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:
```
 if word.lower() in stopwords_english:
 continue
```
subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.

Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)

Applied as:

train['question_text'].progress_apply(lambda x: list(clean_text(x)))

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

stopwords.words('english') sequence.
As clean_text(x) will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into set object for fast containment check (in if word.lower() in stopwords_english):
```
 stopwords_english = set(stopwords.words('english'))
```
instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:
```
 if word.lower() in stopwords_english:
 continue
```
subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.

Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)

Applied as:

train['question_text'].progress_apply(lambda x: list(clean_text(x)))

added 91 characters in body

Source Link

edited Dec 9, 2019 at 21:17

RomanPerekhrest

edited Dec 9, 2019 at 21:17

RomanPerekhrest

5.3k
1
10
21

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

stopwords.words('english') sequence.
As clean_text(x) will be applied for eacheach column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into setset object for fast containment check (in if word.lower() in stopwords_english):
```
 stopwords_english = set(stopwords.words('english'))
```
instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:
```
 if word.lower() in stopwords_english:
 continue
```
subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.

Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)

Applied as:

train['question_text'].progress_apply(lambda x: list(clean_text(x)))

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

stopwords.words('english') sequence.
As clean_text(x) will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into set object for fast containment check (in if word.lower() in stopwords_english):
```
 stopwords_english = set(stopwords.words('english'))
```
instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:
```
 if word.lower() in stopwords_english:
 continue
```
subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.

Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

stopwords.words('english') sequence.
As clean_text(x) will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into set object for fast containment check (in if word.lower() in stopwords_english):
```
 stopwords_english = set(stopwords.words('english'))
```
instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:
```
 if word.lower() in stopwords_english:
 continue
```
subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.

Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)

Applied as:

train['question_text'].progress_apply(lambda x: list(clean_text(x)))

Source Link

answered Dec 9, 2019 at 21:12

RomanPerekhrest

answered Dec 9, 2019 at 21:12

RomanPerekhrest

5.3k
1
10
21

Honestly, I can not consider neither of proposed approaches (except for applying generators) as efficient enough.

Here are my arguments:

stopwords.words('english') sequence.
As clean_text(x) will be applied for each column cell it's better to move a common sequence of stopwords to the top level at once. But that's easy. stopwords.words('english') is actually a list of stopwords and a much more efficient would be to convert it into set object for fast containment check (in if word.lower() in stopwords_english):
```
 stopwords_english = set(stopwords.words('english'))
```
instead of yielding a words that aren't contained in stopwords_english set for further replacements - in opposite, words that are stopwords can be just skipped at once:
```
 if word.lower() in stopwords_english:
 continue
```
subtle nuance: the pattern "/-'" at 1st replacement attempt (for punct in "/-'") is actually contained in a longer pattern of punctuation chars ?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '""’'.
Thus, it's unified into a single pattern and considering that there could be multiple consequent occurrences of punctuation/special char within a word - I suggest to apply a compiled regex pattern with + quantifier (to replace multiple occurrences at once) defined at top level.

Finally, the optimized approach would look as follows:

import re
...
stopwords_english = set(stopwords.words('english'))
punct_pat = re.compile(r'[?!.,"#$%\'()*+-/:;<=>@\[\\\]^_`{|}~""’]+')
def clean_text(x):
 for word in x:
 if word.lower() in stopwords_english:
 continue
 if '&' in word:
 word = word.replace('&', ' & ')
 yield punct_pat.sub('', word)

lang-py