I have written the following function to preprocess some text data as input to machine learning algorithm. It lowercase, tokenises, removes stop words and lemmatizes, returning a string of space-separated tokens. However, this code runs extremely slowly. What can I do to optimise it?
import os
import re
import csv
import time
import nltk
import string
import pickle
import numpy as np
import pandas as pd
import pyparsing as pp
import matplotlib.pyplot as plt
from sklearn import preprocessing
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
def preprocessText(text, lemmatizer, lemma, ps):
'''
Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
'''
words = text.lower()
words = re.sub("[^a-zA-Z]", " ", words)
words = word_tokenize(words)
stemmed_words = []
stops = set(stopwords.words("english"))
meaningful_words = [w for w in words if not w in stops]
text = ""
if lemmatizer == True:
pos_translate = {'J':'a', 'V':'v', 'N':'n', 'R':'r'}
meaningful_words = [lemma.lemmatize(w,pos=pos_translate[pos[0]] if pos[0] in pos_translate else 'n') for w,pos in nltk.pos_tag(meaningful_words)]
for each in meaningful_words:
if len(each) > 1:
text = text +" " + each
return text
else:
words_again = []
for each in meaningful_words:
words_again.append(ps.stem(each))
text = ""
for each in words_again:
if len(each) > 1:
text = text +" " +each
return(text)
2 Answers 2
Given that you are already using Python, I would highly recommend using Spacy (base text parsing & tagging) and Textacy (higher level text processing built on top of Spacy). It can do everything you want to do, and more, with one function call:
http://textacy.readthedocs.io/en/latest/api_reference.html#textacy.preprocess.preprocess_text
For your further travels in text based machine learning, there are also a wealth of additional features, particularly with Spacy 2.0 and its universe.
You can cut down on the number of times you iterate over the words, by filtering in a single loop. E.g. in case when lemmatizer
is "falsy":
def preprocess_text_new(text, ps):
'''
Lowercase, tokenises, removes stop words and lemmatize's using word net. Returns a string of space seperated tokens.
'''
words = re.sub(r"[^a-zA-Z]", " ", text.lower())
words = word_tokenize(words)
stops = set(stopwords.words("english"))
result = []
for word in words:
if word not in stops:
continue
stemmed = ps.stem(word)
if len(stemmed) > 1:
result.append(stemmed)
return " ".join(result)
Note the use of the much faster str.join()
instead of continuously concatenating a string.
If you are executing the function multiple times, you should not redo the things you've already done. E.g the stopwords
set can be defined prior the function execution, the regular expression can be pre-compiled.
From what I can conclude after profiling the code without lemmatizing, stemming is the largest contributor to the overall execution time, it is costly. If it is at all possible, you can optimize things by caching the words that were already stemmed:
result = []
cache = {}
for word in words:
# ...
if word not in cache:
stemmed = ps.stem(word)
cache[word] = stemmed
else:
stemmed = cache[word]
result.append(stemmed)
Or, you can pre-compute stems for the most popular words in the corpus you are working in (difficult to tell how effective it would be, but please do experiment).
The same pre-computation and memoization idea would also work for the lemmatization part.
Also, since the PyPy
interpreter supports nltk
, check if using it can provide a performance boost.
Explore related questions
See similar questions with these tags.