1
\$\begingroup\$

I am studying the techniques of data mining and data processing. I'm doing this through data I've collected and stored in a csv file. The problem is that this filed was very large, to the point of having as astonishing 40 thousand lines of text.

Some of the algorithms in the rendering part are fast and agile, but the part of the orthographic correction of words is laborious. I am using the NLTK package nltk.corpus import forest. So when it comes time to do this step, I daresay it will not end in a timely manner.

From this, I was wondering if someone can help me with a solution where I can read a file line, do the whole process, save it to the bank and then read another line from the file. So by reading line by line and each line do the process. I think this way I can improve the performance of the algorithm.

txtCorpus = []
dtype_dic= {'status_id': str, 'status_message' : str, 'status_published':str}
for csvfile in pd.read_csv('data/MyCSV.csv',dtype=dtype_dic,encoding='utf-8',sep=',', header='infer',engine='c', chunksize=2):
 txtCorpus.append(csvfile)
def status_processing(txtCorpus):
 myCorpus = preprocessing.PreProcessing()
 myCorpus.text = str(txtCorpus)
 print "Doing the Initial Process..."
 myCorpus.initial_processing()
 print "Done."
 print "----------------------------"
 print ("StartingLexical Diversity...")
 myCorpus.lexical_diversity()
 print "Done"
 print "----------------------------"
 print "Removing Stopwords..."
 myCorpus.stopwords()
 print "Done"
 print "----------------------------"
 print "Lemmatization..."
 myCorpus.lemmatization()
 print "Feito"
 print "----------------------------"
 print "Correcting the words..."
 myCorpus.spell_correct()
 print "Done"
 print "----------------------------"
 print "Untokenizing..."
 word_final = myCorpus.untokenizing()
 print "Feito"
 print "----------------------------"
 print "Saving in DB...."
 try:
 db.myDB.insert(word_final, continue_on_error=True)
 except pymongo.errors.DuplicateKeyError:
 pass
 print "Insertion in the BB Completed. End of the Pre-Processing Process "
def main():
 status_processing(txtCorpus)
main()

I believe that by visualizing the code, you can better understand what I explained above. I thought about doing a for where I read a line and passed it on def status_processing(txtCorpus): and so, I repeated the process until the end. But I could not reach a solution.

preprocessing file:

import nltk,re, htmlentitydefs
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
import spellcorrect
class Techniques(object):
 Lemmatizing = 1
 Stopwords = 2
 Stemming = 3
 Spellcorrect = 4
 def __init__(self, Type):
 self.value = Type
 def __str__(self):
 if self.value == Techniques.Lemmatizing:
 return 'Lemmatizing'
 if self.value == Techniques.Stopwords:
 return 'Stopwords'
 if self.value == Techniques.Stemming:
 return 'Stemming'
 if self.value == Techniques.Spellcorrect:
 return 'Spell Correct'
 def __eq__(self,y):
 return self.value==y.value
class PreProcessing():
 @property
 def text(self):
 return self.__text
 @text.setter
 def text(self, text):
 self.__text = text
 tokens = None
 def initial_processing(self):
 soup = BeautifulSoup(self.text,"html.parser")
 self.text = soup.get_text()
 #Todo Se quiser salvar os links mudar aqui
 self.text = re.sub(r'(http://|https://|www.)[^"\' ]+', " ", self.text)
 self.tokens = self.tokenizing(1, self.text)
 pass
 def lexical_diversity(self):
 word_count = len(self.text)
 vocab_size = len(set(self.text))
 return vocab_size / word_count
 def tokenizing(self, type, text):
 if (type == 1):
 return nltk.tokenize.word_tokenize(text)
 elif (type == 2):
 stok = nltk.data.load('tokenizers/punkt/portuguese.pickle')
 #stok = nltk.PunktSentenceTokenizer(train)
 return stok.tokenize(text)
 def stopwords(self):
 stopwords = nltk.corpus.stopwords.words('portuguese')
 stopWords = set(stopwords)
 palavroesPortugues = ['foda','caralho', 'porra', 'puta', 'merda', 'cu', 'foder', 'viado', 'cacete']
 stopWords.update(palavroesPortugues)
 filteredWords = []
 for word in self.tokens:
 if word not in stopWords:
 filteredWords.append(word)
 self.tokens = filteredWords
 def stemming(self):
 snowball = SnowballStemmer('portuguese')
 stemmedWords = []
 for word in self.tokens:
 stemmedWords.append(snowball.stem(word))
 self.tokens = stemmedWords
 def lemmatization(self):
 lemmatizer = WordNetLemmatizer()#'portuguese'
 lemmatizedWords = []
 for word in self.tokens:
 lemmatizedWords.append(lemmatizer.lemmatize(word, pos='v'))
 self.tokens = lemmatizedWords
 def part_of_speech_tagging(self):
 return 'Not implemented yet'
 def padronizacaoInternetes(self):
 return 'Not implementes yet'
 def untokenize(self, words):
 """
 Untokenizing a text undoes the tokenizing operation, restoring
 punctuation and spaces to the places that people expect them to be.
 Ideally, `untokenize(tokenize(text))` should be identical to `text`,
 except for line breaks.
 """
 text = ' '.join(words)
 step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .', '...')
 step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
 step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"1円2円", step2)
 step4 = re.sub(r' ([.,:;?!%]+)$', r"1円", step3)
 step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
 "can not", "cannot")
 step6 = step5.replace(" ` ", " '")
 return step6.strip()
 def untokenizing(self):
 return ' '.join(self.tokens)
 #return self.untokenize(self.tokens)
 #return tokenize.untokenize(self.tokens)
 def spell_correct(self):
 correctedWords = []
 spell = spellcorrect.SpellCorrect()
 for word in self.tokens:
 correctedWords.append(spell.correct(word))
 self.tokens = correctedWords

spellcorret file:

import re, collections
from nltk.corpus import floresta
class SpellCorrect:
 def words(self, text): return re.findall('[a-z]+', text.lower())
 def train(features):
 model = collections.defaultdict(lambda: 1)
 for f in features:
 model[f] += 1
 return model
 NWORDS = train(floresta.words()) #words(file('big.txt').read())
 alphabet = 'abcdefghijklmnopqrstuvwxyz'
 def edits1(self, word):
 splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
 deletes = [a + b[1:] for a, b in splits if b]
 transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
 replaces = [a + c + b[1:] for a, b in splits for c in self.alphabet if b]
 inserts = [a + c + b for a, b in splits for c in self.alphabet]
 return set(deletes + transposes + replaces + inserts)
 def known_edits2(self, word):
 return set(e2 for e1 in self.edits1(word) for e2 in self.edits1(e1) if e2 in self.NWORDS)
 def known(self, words): return set(w for w in words if w in self.NWORDS)
 def correct(self, word):
 candidates = self.known([word]) or self.known(self.edits1(word)) or self.known_edits2(word) or [word]
 return max(candidates, key=self.NWORDS.get)
asked Nov 11, 2016 at 21:05
\$\endgroup\$
3
  • 1
    \$\begingroup\$ I assume that it's status_processing() that is taking a long time, rather than reading the CSV file? If so, I'm not sure we can help you, considering that you haven't shown us the code behind myCorpus. \$\endgroup\$ Commented Nov 11, 2016 at 21:14
  • \$\begingroup\$ There's too much code missing, I'm afraid we can't review it in it's current form. Please take a look at the help center. \$\endgroup\$ Commented Nov 11, 2016 at 22:25
  • \$\begingroup\$ @200_success Okay. I'll edit the question and show the other script. It can get a little big. Regarding status_processing() The first operations it does in an acceptable time. The problem is when I invoke the spell_correct (). Already after I read the entire CSV file, it has to fix everything. \$\endgroup\$ Commented Nov 12, 2016 at 0:15

1 Answer 1

2
\$\begingroup\$

spellcorrect.py

  • SpellCorrect should not be a class. Your two "working" methods (train and edit1) does not reference self at all and the other ones only use self for its namespace. You should provide functions instead.
  • As far as I can tell, the words method is not used anymore since you commented it in the building of NWORDS.
  • alphabet is better imported from string: from string import ascii_lowercase as alphabet.
  • I don't understand the definition of model in train. Why give a score of 1 to missing features; and so a score of 2 for features encountered once? Moreover, it the aim of train is to count how many times a given feature appears in features, you'd be better off using a collection.Counter.
  • You will have a better memory footprint if you turn edit1 into a generator. Just yield (and yield from in Python 3) computed elements instead of storing them in a list.
  • Turning edit1 into a generator will allow you to do so with edit2 without requiring it to filter elements itself. And let this job to known alone. Avoiding the discrepancy between how you build your words.
  • In edit1, you can iterate more easily over words and still get the index using enumerate. It can simplify some of your checks.
import collections
from string import ascii_lowercase as alphabet
from nltk.corpus import floresta
NWORDS = collections.Counter(floresta.words())
def edits1(word):
 for i, letter in enumerate(word):
 begin, end = word[:i], word[i+1:]
 yield begin + end # delete
 if end:
 yield begin + end[0] + letter + end[1:] # transpose
 else:
 for other in alphabet:
 yield begin + letter + other # insert at the end
 for other in alphabet:
 yield begin + other + end # replace
 yield begin + other + letter + end # insert before the current letter
def edits2(word):
 for editted_once in edits1(word):
 for editted_twice in edits1(editted_once):
 yield editted_twice
def known(words):
 return set(w for w in words if w in NWORDS)
def correct(word):
 candidates = known([word]) or known(edits1(word)) or known(edits2(word)) or [word]
 return max(candidates, key=self.NWORDS.get)

Main file

  • Your top-level code should be under an if __name__ == '__main__': clause. So move your txtCorpus building and your call to main there.
  • In fact, main would be of better interest if it built txtCorpus itself before calling status_processing.
  • status_processing also does more than it advertise as it process the statuses, but also save the result in DB. You should let the caller do whatever they want with the processed results.
  • All these print can be unneeded to someone else. Consider using the logging module instead.
def status_processing(corpus):
 myCorpus = preprocessing.PreProcessing()
 myCorpus.text = str(corpus)
 print "Doing the Initial Process..."
 myCorpus.initial_processing()
 print "Done."
 print "----------------------------"
 print ("StartingLexical Diversity...")
 myCorpus.lexical_diversity()
 print "Done"
 print "----------------------------"
 print "Removing Stopwords..."
 myCorpus.stopwords()
 print "Done"
 print "----------------------------"
 print "Lemmatization..."
 myCorpus.lemmatization()
 print "Feito"
 print "----------------------------"
 print "Correcting the words..."
 myCorpus.spell_correct()
 print "Done"
 print "----------------------------"
 print "Untokenizing..."
 word_final = myCorpus.untokenizing()
 print "Feito"
 print "----------------------------"
 return word_final
if __name__ == '__main__':
 dtype_dic = {'status_id': str, 'status_message': str, 'status_published': str}
 txt_corpus = list(pd.read_csv(
 'data/MyCSV.csv', dtype=dtype_dic,
 encoding='utf-8', sep=',',
 header='infer', engine='c', chunksize=2))
 word_final = status_processing(txt_corpus)
 print "Saving in DB...."
 try:
 db.myDB.insert(word_final, continue_on_error=True)
 except pymongo.errors.DuplicateKeyError:
 pass
 print "Insertion in the DB Completed. End of the Pre-Processing Process "

preprocessing.py

  • Techniques should be an enum. You can use flufl.enum if you need them in Python 2. But I don't see them used anywhere in the code, you can get rid of that class.
  • Since it seems that the code is for Python 2, you should have PreProcessing inherits from object.
  • The text property of PreProcessing does not add value over a self.text attribute initialized in the constructor. Especially since you need to set it for the other methods to work.
  • pass is unnecessary for non-empty blocks.
  • tokenizing offers a choice between two variants, a boolean parameter would be more suited here. And since you seems to use only one of them, you can give it a default value.
  • I would merge __init__ and initial_processing since this method populate the self.tokens attribute with the initial set of tokens every other method works with.
  • Using raise NotImplementedError instead of return 'Not implemented yet' is much more meaningfull.
  • Consider using list-comprehensions or the list constructor instead of manually appending items into an empty list.
import nltk
import re
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
import spellcorrect
class PreProcessing():
 def __init__(self, text):
 soup = BeautifulSoup(text, "html.parser")
 #Todo Se quiser salvar os links mudar aqui
 self.text = re.sub(r'(http://|https://|www.)[^"\' ]+', " ", soup.get_text())
 self.tokens = self.tokenizing()
 def lexical_diversity(self):
 word_count = len(self.text)
 vocab_size = len(set(self.text))
 return vocab_size / word_count
 def tokenizing(self, use_default_tokenizer=True):
 if use_default_tokenizer:
 return nltk.tokenize.word_tokenize(self.text)
 stok = nltk.data.load('tokenizers/punkt/portuguese.pickle')
 return stok.tokenize(self.text)
 def stopwords(self):
 stopwords = set(nltk.corpus.stopwords.words('portuguese'))
 stopwords.update([
 'foda', 'caralho', 'porra',
 'puta', 'merda', 'cu',
 'foder', 'viado', 'cacete'])
 self.tokens = [word for word in self.tokens if word not in stopwords]
 def stemming(self):
 snowball = SnowballStemmer('portuguese')
 self.tokens = [snowball.stem(word) for word in self.tokens]
 def lemmatization(self):
 lemmatizer = WordNetLemmatizer() #'portuguese'
 self.tokens = [lemmatizer.lemmatize(word, pos='v') for word in self.tokens]
 def part_of_speech_tagging(self):
 raise NotImplementedError
 def padronizacaoInternetes(self):
 raise NotImplementedError
 def untokenize(self, words):
 """
 Untokenizing a text undoes the tokenizing operation, restoring
 punctuation and spaces to the places that people expect them to be.
 Ideally, `untokenize(tokenize(text))` should be identical to `text`,
 except for line breaks.
 """
 text = ' '.join(words)
 step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .', '...')
 step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
 step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"1円2円", step2)
 step4 = re.sub(r' ([.,:;?!%]+)$', r"1円", step3)
 step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
 "can not", "cannot")
 step6 = step5.replace(" ` ", " '")
 return step6.strip()
 def untokenizing(self):
 return ' '.join(self.tokens)
 def spell_correct(self):
 self.tokens = [spellcorrect.correct(word) for word in self.tokens]

More generic comments

Consider reading (and following) PEP 8, the official Python style guide; especially as regard to:

  • import declarations;
  • whitespace around operators, comas...
  • and variable names.

Also consider using docstrings all around your code, it will make it more easier to understand.

answered Nov 12, 2016 at 12:06
\$\endgroup\$
5
  • \$\begingroup\$ One thing I noticed later was: I used the pandas instead of the csv module. The reasons: I do not know why, but csv module was not reading my file on utf-8, many of the lines came in this form \ u0159. With the panda I got by to read the file in UTF-8, however, while trying to run the insert, the mongo was not accepting the text. I may need to re-use the csv module and find a way to run the encoding. \$\endgroup\$ Commented Nov 12, 2016 at 15:13
  • \$\begingroup\$ Trying to implement his notes, in preprocessing.py If I merge the __init__ I will have no option to pass a parameter in status_processing. According to its: In fact,` main` would be of better interest if it built txtCorpus itself before calling status_processingstatus_processing \$\endgroup\$ Commented Nov 12, 2016 at 15:55
  • \$\begingroup\$ @LeandroSantos I have a hard time figuring out what your really mean. If you need an additional parameter to status_processing, you can just add it. If this parameter needs to be passed to what was before initial_processing, just add it to __init__. The thing is, almost every method in PreProcessing relly on self.tokens to be populated with an iterable, so initializing it to None is a bad thing. Thus you might as well do your initialization into __init__. \$\endgroup\$ Commented Nov 12, 2016 at 16:40
  • \$\begingroup\$ Hmm. Look. By your logic, passing a parameter on __init__ like word_final = status_processing(txt_corpus) the output was: TypeError: status_processing() takes no arguments (1 given) in this line ´word_final = status_processing(txt_corpus)´. Apparently, you need to pass a parameter on __init__ in the method signature. That's what I meant in the previous comment. And after all, if i pass like this def status_processing(txt_corpus): the output is TypeError: __init__() takes exactly 2 arguments (1 given) \$\endgroup\$ Commented Nov 12, 2016 at 18:08
  • \$\begingroup\$ @LeandroSantos status_processing is already defined as def status_processing(corpus):. And as regard to the __init__ error, since I modified PreProcess afterwards, you'll need to adapt and change myCorpus = preprocessing.PreProcessing() into myCorpus = preprocessing.PreProcessing(corpus) (obviously removing the next two lines in doing so). \$\endgroup\$ Commented Nov 12, 2016 at 18:32

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.