5
\$\begingroup\$

I have written the following function to preprocess some text data as input to machine learning algorithm. It lowercase, tokenises, removes stop words and lemmatizes, returning a string of space-separated tokens. However, this code runs extremely slowly. What can I do to optimise it?

import os
import re
import csv
import time
import nltk
import string
import pickle
import numpy as np
import pandas as pd
import pyparsing as pp
import matplotlib.pyplot as plt
from sklearn import preprocessing
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
def preprocessText(text, lemmatizer, lemma, ps):
 '''
 Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
 '''
 words = text.lower()
 words = re.sub("[^a-zA-Z]", " ", words)
 words = word_tokenize(words)
 stemmed_words = []
 stops = set(stopwords.words("english"))
 meaningful_words = [w for w in words if not w in stops]
 text = ""
 if lemmatizer == True:
 pos_translate = {'J':'a', 'V':'v', 'N':'n', 'R':'r'}
 meaningful_words = [lemma.lemmatize(w,pos=pos_translate[pos[0]] if pos[0] in pos_translate else 'n') for w,pos in nltk.pos_tag(meaningful_words)]
 for each in meaningful_words:
 if len(each) > 1:
 text = text +" " + each
 return text
 else:
 words_again = []
 for each in meaningful_words:
 words_again.append(ps.stem(each))
 text = ""
 for each in words_again:
 if len(each) > 1:
 text = text +" " +each
 return(text)
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Jan 31, 2017 at 15:12
\$\endgroup\$
0

2 Answers 2

3
\$\begingroup\$

Given that you are already using Python, I would highly recommend using Spacy (base text parsing & tagging) and Textacy (higher level text processing built on top of Spacy). It can do everything you want to do, and more, with one function call:

http://textacy.readthedocs.io/en/latest/api_reference.html#textacy.preprocess.preprocess_text

For your further travels in text based machine learning, there are also a wealth of additional features, particularly with Spacy 2.0 and its universe.

answered Jun 13, 2018 at 1:56
\$\endgroup\$
1
\$\begingroup\$

You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer is "falsy":

def preprocess_text_new(text, ps):
 '''
 Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
 '''
 words = re.sub(r"[^a-zA-Z]", " ", text.lower())
 words = word_tokenize(words)
 stops = set(stopwords.words("english"))
 result = []
 for word in words:
 if word not in stops:
 continue
 stemmed = ps.stem(word)
 if len(stemmed) > 1:
 result.append(stemmed)
 return " ".join(result)

Note the use of the much faster str.join() instead of continuously concatenating a string.

If you are executing the function multiple times, you should not redo the things you've already done. E.g the stopwords set can be defined prior the function execution, the regular expression can be pre-compiled.

From what I can conclude after profiling the code without lemmatizing, stemming is the largest contributor to the overall execution time, it is costly. If it is at all possible, you can optimize things by caching the words that were already stemmed:

result = []
cache = {}
for word in words:
 # ... 
 if word not in cache:
 stemmed = ps.stem(word)
 cache[word] = stemmed
 else:
 stemmed = cache[word]
 result.append(stemmed)

Or, you can pre-compute stems for the most popular words in the corpus you are working in (difficult to tell how effective it would be, but please do experiment).

The same pre-computation and memoization idea would also work for the lemmatization part.

Also, since the PyPy interpreter supports nltk, check if using it can provide a performance boost.

answered Jan 31, 2017 at 19:16
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.