Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/

edited May 23, 2017 at 12:40

You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer is "falsy":

def preprocess_text_new(text, ps):
 '''
 Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
 '''
 words = re.sub(r"[^a-zA-Z]", " ", text.lower())
 words = word_tokenize(words)
 stops = set(stopwords.words("english"))
 result = []
 for word in words:
 if word not in stops:
 continue
 stemmed = ps.stem(word)
 if len(stemmed) > 1:
 result.append(stemmed)
 return " ".join(result)

Note the use of the much faster str.join() instead of continuously concatenating a string much faster str.join() instead of continuously concatenating a string.

If you are executing the function multiple times, you should not redo the things you've already done. E.g the stopwords set can be defined prior the function execution, the regular expression can be pre-compiled.

From what I can conclude after profiling the code without lemmatizing, stemming is the largest contributor to the overall execution time, it is costly. If it is at all possible, you can optimize things by caching the words that were already stemmed:

result = []
cache = {}
for word in words:
 # ... 
 if word not in cache:
 stemmed = ps.stem(word)
 cache[word] = stemmed
 else:
 stemmed = cache[word]
 result.append(stemmed)

Or, you can pre-compute stems for the most popular words in the corpus you are working in (difficult to tell how effective it would be, but please do experiment).

The same pre-computation and memoization idea would also work for the lemmatization part.

Also, since the PyPy interpreter supports nltk, check if using it can provide a performance boost.

You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer is "falsy":

def preprocess_text_new(text, ps):
 '''
 Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
 '''
 words = re.sub(r"[^a-zA-Z]", " ", text.lower())
 words = word_tokenize(words)
 stops = set(stopwords.words("english"))
 result = []
 for word in words:
 if word not in stops:
 continue
 stemmed = ps.stem(word)
 if len(stemmed) > 1:
 result.append(stemmed)
 return " ".join(result)

Note the use of the much faster str.join() instead of continuously concatenating a string.

result = []
cache = {}
for word in words:
 # ... 
 if word not in cache:
 stemmed = ps.stem(word)
 cache[word] = stemmed
 else:
 stemmed = cache[word]
 result.append(stemmed)

Or, you can pre-compute stems for the most popular words in the corpus you are working in (difficult to tell how effective it would be, but please do experiment).

The same pre-computation and memoization idea would also work for the lemmatization part.

Also, since the PyPy interpreter supports nltk, check if using it can provide a performance boost.

You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer is "falsy":

def preprocess_text_new(text, ps):
 '''
 Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
 '''
 words = re.sub(r"[^a-zA-Z]", " ", text.lower())
 words = word_tokenize(words)
 stops = set(stopwords.words("english"))
 result = []
 for word in words:
 if word not in stops:
 continue
 stemmed = ps.stem(word)
 if len(stemmed) > 1:
 result.append(stemmed)
 return " ".join(result)

Note the use of the much faster str.join() instead of continuously concatenating a string.

result = []
cache = {}
for word in words:
 # ... 
 if word not in cache:
 stemmed = ps.stem(word)
 cache[word] = stemmed
 else:
 stemmed = cache[word]
 result.append(stemmed)

Or, you can pre-compute stems for the most popular words in the corpus you are working in (difficult to tell how effective it would be, but please do experiment).

The same pre-computation and memoization idea would also work for the lemmatization part.

Also, since the PyPy interpreter supports nltk, check if using it can provide a performance boost.

added 23 characters in body

Source Link

edited Jan 31, 2017 at 19:31

alecxe

edited Jan 31, 2017 at 19:31

alecxe

17.5k
8
52
93

It is, of course, important to determine the performance bottleneck(s) by profiling the code, but, here are some observations to scratch the surface.

You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer is "falsy":

def preprocess_text_new(text, ps):
 '''
 Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
 '''
 words = re.sub("[^ar"[^a-zA-Z]", " ", text.lower())
 words = word_tokenize(words)
 stops = set(stopwords.words("english"))
 result = []
 for word in words:
 if word not in stops:
 continue
 stemmed = ps.stem(word)
 if len(stemmed) > 1:
 result.append(stemmed)
 return " ".join(result)

Note the use of the much faster str.join() instead of continuously concatenating a string.

result = []
cache = {}
for word in words:
 # ... 
 if word not in cache:
 stemmed = ps.stem(word)
 cache[word] = stemmed
 else:
 stemmed = cache[word]
 result.append(stemmed)

Or, you can pre-compute stems for the most popular words in the corpus you are working in (difficult to tell how effective it would be, but please do experiment).

The same pre-computation and memoization idea would also work for the lemmatization part.

Also, since the PyPy interpreter supports nltk, check if using it can provide a performance boost.

It is, of course, important to determine the performance bottleneck(s) by profiling the code, but, here are some observations to scratch the surface.

You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer is "falsy":

def preprocess_text_new(text, ps):
 '''
 Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
 '''
 words = re.sub("[^a-zA-Z]", " ", text.lower())
 words = word_tokenize(words)
 stops = set(stopwords.words("english"))
 result = []
 for word in words:
 if word not in stops:
 continue
 stemmed = ps.stem(word)
 if len(stemmed) > 1:
 result.append(stemmed)
 return " ".join(result)

Note the use of much faster str.join() instead of continuously concatenating a string.

Also, since the PyPy interpreter supports nltk, check if using it can provide a performance boost.

You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer is "falsy":

def preprocess_text_new(text, ps):
 '''
 Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
 '''
 words = re.sub(r"[^a-zA-Z]", " ", text.lower())
 words = word_tokenize(words)
 stops = set(stopwords.words("english"))
 result = []
 for word in words:
 if word not in stops:
 continue
 stemmed = ps.stem(word)
 if len(stemmed) > 1:
 result.append(stemmed)
 return " ".join(result)

Note the use of the much faster str.join() instead of continuously concatenating a string.

result = []
cache = {}
for word in words:
 # ... 
 if word not in cache:
 stemmed = ps.stem(word)
 cache[word] = stemmed
 else:
 stemmed = cache[word]
 result.append(stemmed)

Or, you can pre-compute stems for the most popular words in the corpus you are working in (difficult to tell how effective it would be, but please do experiment).

The same pre-computation and memoization idea would also work for the lemmatization part.

Also, since the PyPy interpreter supports nltk, check if using it can provide a performance boost.

added 23 characters in body

Source Link

edited Jan 31, 2017 at 19:25

alecxe

edited Jan 31, 2017 at 19:25

alecxe

17.5k
8
52
93

It is, of course, important to determine the performance bottleneck(s) by profiling the code, but, here are some observations to scratch the surface.

You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer is "falsy":

def preprocess_text_new(text, ps):
 '''
 Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
 '''
 words = re.sub("[^a-zA-Z]", " ", text.lower())
 words = word_tokenize(words)
 stops = set(stopwords.words("english"))
 result = []
 for word in words:
 if word not in stops:
 continue
 stemmed = ps.stem(word)
 if len(stemmed) > 1:
 result.append(stemmed)
 return " ".join(result)

Note the use of much faster str.join() instead of continuously concatenating a string.

Also, since the PyPy interpreter supports nltk, check if using it can provide a performance boost.

It is, of course, important to determine the performance bottleneck(s) by profiling the code, but, here are some observations.

You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer is "falsy":

def preprocess_text_new(text, ps):
 '''
 Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
 '''
 words = re.sub("[^a-zA-Z]", " ", text.lower())
 words = word_tokenize(words)
 stops = set(stopwords.words("english"))
 result = []
 for word in words:
 if word not in stops:
 continue
 stemmed = ps.stem(word)
 if len(stemmed) > 1:
 result.append(stemmed)
 return " ".join(result)

Note the use of much faster str.join() instead of continuously concatenating a string.

Also, since the PyPy interpreter supports nltk, check if using it can provide a performance boost.

It is, of course, important to determine the performance bottleneck(s) by profiling the code, but, here are some observations to scratch the surface.

You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer is "falsy":

def preprocess_text_new(text, ps):
 '''
 Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
 '''
 words = re.sub("[^a-zA-Z]", " ", text.lower())
 words = word_tokenize(words)
 stops = set(stopwords.words("english"))
 result = []
 for word in words:
 if word not in stops:
 continue
 stemmed = ps.stem(word)
 if len(stemmed) > 1:
 result.append(stemmed)
 return " ".join(result)

Note the use of much faster str.join() instead of continuously concatenating a string.

Also, since the PyPy interpreter supports nltk, check if using it can provide a performance boost.

Source Link

answered Jan 31, 2017 at 19:16

alecxe

answered Jan 31, 2017 at 19:16

alecxe

17.5k
8
52
93

lang-py