You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer
is "falsy":
def preprocess_text_new(text, ps):
'''
Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
'''
words = re.sub(r"[^a-zA-Z]", " ", text.lower())
words = word_tokenize(words)
stops = set(stopwords.words("english"))
result = []
for word in words:
if word not in stops:
continue
stemmed = ps.stem(word)
if len(stemmed) > 1:
result.append(stemmed)
return " ".join(result)
Note the use of the much faster str.join()
instead of continuously concatenating a string much faster str.join()
instead of continuously concatenating a string.
If you are executing the function multiple times, you should not redo the things you've already done. E.g the stopwords
set can be defined prior the function execution, the regular expression can be pre-compiled.
From what I can conclude after profiling the code without lemmatizing, stemming is the largest contributor to the overall execution time, it is costly. If it is at all possible, you can optimize things by caching the words that were already stemmed:
result = []
cache = {}
for word in words:
# ...
if word not in cache:
stemmed = ps.stem(word)
cache[word] = stemmed
else:
stemmed = cache[word]
result.append(stemmed)
Or, you can pre-compute stems for the most popular words in the corpus you are working in (difficult to tell how effective it would be, but please do experiment).
The same pre-computation and memoization idea would also work for the lemmatization part.
Also, since the PyPy
interpreter supports nltk
, check if using it can provide a performance boost.
You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer
is "falsy":
def preprocess_text_new(text, ps):
'''
Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
'''
words = re.sub(r"[^a-zA-Z]", " ", text.lower())
words = word_tokenize(words)
stops = set(stopwords.words("english"))
result = []
for word in words:
if word not in stops:
continue
stemmed = ps.stem(word)
if len(stemmed) > 1:
result.append(stemmed)
return " ".join(result)
Note the use of the much faster str.join()
instead of continuously concatenating a string.
If you are executing the function multiple times, you should not redo the things you've already done. E.g the stopwords
set can be defined prior the function execution, the regular expression can be pre-compiled.
From what I can conclude after profiling the code without lemmatizing, stemming is the largest contributor to the overall execution time, it is costly. If it is at all possible, you can optimize things by caching the words that were already stemmed:
result = []
cache = {}
for word in words:
# ...
if word not in cache:
stemmed = ps.stem(word)
cache[word] = stemmed
else:
stemmed = cache[word]
result.append(stemmed)
Or, you can pre-compute stems for the most popular words in the corpus you are working in (difficult to tell how effective it would be, but please do experiment).
The same pre-computation and memoization idea would also work for the lemmatization part.
Also, since the PyPy
interpreter supports nltk
, check if using it can provide a performance boost.
You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer
is "falsy":
def preprocess_text_new(text, ps):
'''
Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
'''
words = re.sub(r"[^a-zA-Z]", " ", text.lower())
words = word_tokenize(words)
stops = set(stopwords.words("english"))
result = []
for word in words:
if word not in stops:
continue
stemmed = ps.stem(word)
if len(stemmed) > 1:
result.append(stemmed)
return " ".join(result)
Note the use of the much faster str.join()
instead of continuously concatenating a string.
If you are executing the function multiple times, you should not redo the things you've already done. E.g the stopwords
set can be defined prior the function execution, the regular expression can be pre-compiled.
From what I can conclude after profiling the code without lemmatizing, stemming is the largest contributor to the overall execution time, it is costly. If it is at all possible, you can optimize things by caching the words that were already stemmed:
result = []
cache = {}
for word in words:
# ...
if word not in cache:
stemmed = ps.stem(word)
cache[word] = stemmed
else:
stemmed = cache[word]
result.append(stemmed)
Or, you can pre-compute stems for the most popular words in the corpus you are working in (difficult to tell how effective it would be, but please do experiment).
The same pre-computation and memoization idea would also work for the lemmatization part.
Also, since the PyPy
interpreter supports nltk
, check if using it can provide a performance boost.
It is, of course, important to determine the performance bottleneck(s) by profiling the code, but, here are some observations to scratch the surface.
You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer
is "falsy":
def preprocess_text_new(text, ps):
'''
Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
'''
words = re.sub("[^ar"[^a-zA-Z]", " ", text.lower())
words = word_tokenize(words)
stops = set(stopwords.words("english"))
result = []
for word in words:
if word not in stops:
continue
stemmed = ps.stem(word)
if len(stemmed) > 1:
result.append(stemmed)
return " ".join(result)
Note the use of the much faster str.join()
instead of continuously concatenating a string.
If you are executing the function multiple times, you should not redo the things you've already done. E.g the stopwords
set can be defined prior the function execution, the regular expression can be pre-compiled.
From what I can conclude after profiling the code without lemmatizing, stemming is the largest contributor to the overall execution time, it is costly. If it is at all possible, you can optimize things by caching the words that were already stemmed:
result = []
cache = {}
for word in words:
# ...
if word not in cache:
stemmed = ps.stem(word)
cache[word] = stemmed
else:
stemmed = cache[word]
result.append(stemmed)
Or, you can pre-compute stems for the most popular words in the corpus you are working in (difficult to tell how effective it would be, but please do experiment).
The same pre-computation and memoization idea would also work for the lemmatization part.
Also, since the PyPy
interpreter supports nltk
, check if using it can provide a performance boost.
It is, of course, important to determine the performance bottleneck(s) by profiling the code, but, here are some observations to scratch the surface.
You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer
is "falsy":
def preprocess_text_new(text, ps):
'''
Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
'''
words = re.sub("[^a-zA-Z]", " ", text.lower())
words = word_tokenize(words)
stops = set(stopwords.words("english"))
result = []
for word in words:
if word not in stops:
continue
stemmed = ps.stem(word)
if len(stemmed) > 1:
result.append(stemmed)
return " ".join(result)
Note the use of much faster str.join()
instead of continuously concatenating a string.
If you are executing the function multiple times, you should not redo the things you've already done. E.g the stopwords
set can be defined prior the function execution, the regular expression can be pre-compiled.
Also, since the PyPy
interpreter supports nltk
, check if using it can provide a performance boost.
You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer
is "falsy":
def preprocess_text_new(text, ps):
'''
Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
'''
words = re.sub(r"[^a-zA-Z]", " ", text.lower())
words = word_tokenize(words)
stops = set(stopwords.words("english"))
result = []
for word in words:
if word not in stops:
continue
stemmed = ps.stem(word)
if len(stemmed) > 1:
result.append(stemmed)
return " ".join(result)
Note the use of the much faster str.join()
instead of continuously concatenating a string.
If you are executing the function multiple times, you should not redo the things you've already done. E.g the stopwords
set can be defined prior the function execution, the regular expression can be pre-compiled.
From what I can conclude after profiling the code without lemmatizing, stemming is the largest contributor to the overall execution time, it is costly. If it is at all possible, you can optimize things by caching the words that were already stemmed:
result = []
cache = {}
for word in words:
# ...
if word not in cache:
stemmed = ps.stem(word)
cache[word] = stemmed
else:
stemmed = cache[word]
result.append(stemmed)
Or, you can pre-compute stems for the most popular words in the corpus you are working in (difficult to tell how effective it would be, but please do experiment).
The same pre-computation and memoization idea would also work for the lemmatization part.
Also, since the PyPy
interpreter supports nltk
, check if using it can provide a performance boost.
It is, of course, important to determine the performance bottleneck(s) by profiling the code, but, here are some observations to scratch the surface.
You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer
is "falsy":
def preprocess_text_new(text, ps):
'''
Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
'''
words = re.sub("[^a-zA-Z]", " ", text.lower())
words = word_tokenize(words)
stops = set(stopwords.words("english"))
result = []
for word in words:
if word not in stops:
continue
stemmed = ps.stem(word)
if len(stemmed) > 1:
result.append(stemmed)
return " ".join(result)
Note the use of much faster str.join()
instead of continuously concatenating a string.
If you are executing the function multiple times, you should not redo the things you've already done. E.g the stopwords
set can be defined prior the function execution, the regular expression can be pre-compiled.
Also, since the PyPy
interpreter supports nltk
, check if using it can provide a performance boost.
It is, of course, important to determine the performance bottleneck(s) by profiling the code, but, here are some observations.
You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer
is "falsy":
def preprocess_text_new(text, ps):
'''
Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
'''
words = re.sub("[^a-zA-Z]", " ", text.lower())
words = word_tokenize(words)
stops = set(stopwords.words("english"))
result = []
for word in words:
if word not in stops:
continue
stemmed = ps.stem(word)
if len(stemmed) > 1:
result.append(stemmed)
return " ".join(result)
Note the use of much faster str.join()
instead of continuously concatenating a string.
If you are executing the function multiple times, you should not redo the things you've already done. E.g the stopwords
set can be defined prior the function execution, the regular expression can be pre-compiled.
Also, since the PyPy
interpreter supports nltk
, check if using it can provide a performance boost.
It is, of course, important to determine the performance bottleneck(s) by profiling the code, but, here are some observations to scratch the surface.
You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer
is "falsy":
def preprocess_text_new(text, ps):
'''
Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
'''
words = re.sub("[^a-zA-Z]", " ", text.lower())
words = word_tokenize(words)
stops = set(stopwords.words("english"))
result = []
for word in words:
if word not in stops:
continue
stemmed = ps.stem(word)
if len(stemmed) > 1:
result.append(stemmed)
return " ".join(result)
Note the use of much faster str.join()
instead of continuously concatenating a string.
If you are executing the function multiple times, you should not redo the things you've already done. E.g the stopwords
set can be defined prior the function execution, the regular expression can be pre-compiled.
Also, since the PyPy
interpreter supports nltk
, check if using it can provide a performance boost.