Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/

Source Link

edited May 23, 2017 at 11:33

Community Bot

edited May 23, 2017 at 11:33

Community Bot

Your top-level code should be under an if __name__ == '__main__': if __name__ == '__main__': clause. So move your txtCorpus building and your call to main there.
In fact, main would be of better interest if it built txtCorpus itself before calling status_processing.
status_processing also does more than it advertise as it process the statuses, but also save the result in DB. You should let the caller do whatever they want with the processed results.
All these print can be unneeded to someone else. Consider using the logging module instead.

Your top-level code should be under an if __name__ == '__main__': clause. So move your txtCorpus building and your call to main there.
In fact, main would be of better interest if it built txtCorpus itself before calling status_processing.
status_processing also does more than it advertise as it process the statuses, but also save the result in DB. You should let the caller do whatever they want with the processed results.
All these print can be unneeded to someone else. Consider using the logging module instead.

Your top-level code should be under an if __name__ == '__main__': clause. So move your txtCorpus building and your call to main there.
In fact, main would be of better interest if it built txtCorpus itself before calling status_processing.
status_processing also does more than it advertise as it process the statuses, but also save the result in DB. You should let the caller do whatever they want with the processed results.
All these print can be unneeded to someone else. Consider using the logging module instead.

Source Link

answered Nov 12, 2016 at 12:06

301_Moved_Permanently

answered Nov 12, 2016 at 12:06

301_Moved_Permanently

29.4k
3
49
98

#spellcorrect.py

SpellCorrect should not be a class. Your two "working" methods (train and edit1) does not reference self at all and the other ones only use self for its namespace. You should provide functions instead.
As far as I can tell, the words method is not used anymore since you commented it in the building of NWORDS.
alphabet is better imported from string: from string import ascii_lowercase as alphabet.
I don't understand the definition of model in train. Why give a score of 1 to missing features; and so a score of 2 for features encountered once? Moreover, it the aim of train is to count how many times a given feature appears in features, you'd be better off using a collection.Counter.
You will have a better memory footprint if you turn edit1 into a generator. Just yield (and yield from in Python 3) computed elements instead of storing them in a list.
Turning edit1 into a generator will allow you to do so with edit2 without requiring it to filter elements itself. And let this job to known alone. Avoiding the discrepancy between how you build your words.
In edit1, you can iterate more easily over words and still get the index using enumerate. It can simplify some of your checks.

import collections
from string import ascii_lowercase as alphabet
from nltk.corpus import floresta
NWORDS = collections.Counter(floresta.words())
def edits1(word):
 for i, letter in enumerate(word):
 begin, end = word[:i], word[i+1:]
 yield begin + end # delete
 if end:
 yield begin + end[0] + letter + end[1:] # transpose
 else:
 for other in alphabet:
 yield begin + letter + other # insert at the end
 for other in alphabet:
 yield begin + other + end # replace
 yield begin + other + letter + end # insert before the current letter
def edits2(word):
 for editted_once in edits1(word):
 for editted_twice in edits1(editted_once):
 yield editted_twice
def known(words):
 return set(w for w in words if w in NWORDS)
def correct(word):
 candidates = known([word]) or known(edits1(word)) or known(edits2(word)) or [word]
 return max(candidates, key=self.NWORDS.get)

Main file

Your top-level code should be under an if __name__ == '__main__': clause. So move your txtCorpus building and your call to main there.
In fact, main would be of better interest if it built txtCorpus itself before calling status_processing.
status_processing also does more than it advertise as it process the statuses, but also save the result in DB. You should let the caller do whatever they want with the processed results.
All these print can be unneeded to someone else. Consider using the logging module instead.

def status_processing(corpus):
 myCorpus = preprocessing.PreProcessing()
 myCorpus.text = str(corpus)
 print "Doing the Initial Process..."
 myCorpus.initial_processing()
 print "Done."
 print "----------------------------"
 print ("StartingLexical Diversity...")
 myCorpus.lexical_diversity()
 print "Done"
 print "----------------------------"
 print "Removing Stopwords..."
 myCorpus.stopwords()
 print "Done"
 print "----------------------------"
 print "Lemmatization..."
 myCorpus.lemmatization()
 print "Feito"
 print "----------------------------"
 print "Correcting the words..."
 myCorpus.spell_correct()
 print "Done"
 print "----------------------------"
 print "Untokenizing..."
 word_final = myCorpus.untokenizing()
 print "Feito"
 print "----------------------------"
 return word_final
if __name__ == '__main__':
 dtype_dic = {'status_id': str, 'status_message': str, 'status_published': str}
 txt_corpus = list(pd.read_csv(
 'data/MyCSV.csv', dtype=dtype_dic,
 encoding='utf-8', sep=',',
 header='infer', engine='c', chunksize=2))
 word_final = status_processing(txt_corpus)
 print "Saving in DB...."
 try:
 db.myDB.insert(word_final, continue_on_error=True)
 except pymongo.errors.DuplicateKeyError:
 pass
 print "Insertion in the DB Completed. End of the Pre-Processing Process "

#preprocessing.py

Techniques should be an enum. You can use flufl.enum if you need them in Python 2. But I don't see them used anywhere in the code, you can get rid of that class.
Since it seems that the code is for Python 2, you should have PreProcessing inherits from object.
The text property of PreProcessing does not add value over a self.text attribute initialized in the constructor. Especially since you need to set it for the other methods to work.
pass is unnecessary for non-empty blocks.
tokenizing offers a choice between two variants, a boolean parameter would be more suited here. And since you seems to use only one of them, you can give it a default value.
I would merge __init__ and initial_processing since this method populate the self.tokens attribute with the initial set of tokens every other method works with.
Using raise NotImplementedError instead of return 'Not implemented yet' is much more meaningfull.
Consider using list-comprehensions or the list constructor instead of manually appending items into an empty list.

import nltk
import re
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
import spellcorrect
class PreProcessing():
 def __init__(self, text):
 soup = BeautifulSoup(text, "html.parser")
 #Todo Se quiser salvar os links mudar aqui
 self.text = re.sub(r'(http://|https://|www.)[^"\' ]+', " ", soup.get_text())
 self.tokens = self.tokenizing()
 def lexical_diversity(self):
 word_count = len(self.text)
 vocab_size = len(set(self.text))
 return vocab_size / word_count
 def tokenizing(self, use_default_tokenizer=True):
 if use_default_tokenizer:
 return nltk.tokenize.word_tokenize(self.text)
 stok = nltk.data.load('tokenizers/punkt/portuguese.pickle')
 return stok.tokenize(self.text)
 def stopwords(self):
 stopwords = set(nltk.corpus.stopwords.words('portuguese'))
 stopwords.update([
 'foda', 'caralho', 'porra',
 'puta', 'merda', 'cu',
 'foder', 'viado', 'cacete'])
 self.tokens = [word for word in self.tokens if word not in stopwords]
 def stemming(self):
 snowball = SnowballStemmer('portuguese')
 self.tokens = [snowball.stem(word) for word in self.tokens]
 def lemmatization(self):
 lemmatizer = WordNetLemmatizer() #'portuguese'
 self.tokens = [lemmatizer.lemmatize(word, pos='v') for word in self.tokens]
 def part_of_speech_tagging(self):
 raise NotImplementedError
 def padronizacaoInternetes(self):
 raise NotImplementedError
 def untokenize(self, words):
 """
 Untokenizing a text undoes the tokenizing operation, restoring
 punctuation and spaces to the places that people expect them to be.
 Ideally, `untokenize(tokenize(text))` should be identical to `text`,
 except for line breaks.
 """
 text = ' '.join(words)
 step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .', '...')
 step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
 step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"1円2円", step2)
 step4 = re.sub(r' ([.,:;?!%]+)$', r"1円", step3)
 step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
 "can not", "cannot")
 step6 = step5.replace(" ` ", " '")
 return step6.strip()
 def untokenizing(self):
 return ' '.join(self.tokens)
 def spell_correct(self):
 self.tokens = [spellcorrect.correct(word) for word in self.tokens]

#More generic comments

Consider reading (and following) PEP 8, the official Python style guide; especially as regard to:

import declarations;
whitespace around operators, comas...
and variable names.

Also consider using docstrings all around your code, it will make it more easier to understand.

lang-py