- Your top-level code should be under an
if __name__ == '__main__':
if __name__ == '__main__':
clause. So move yourtxtCorpus
building and your call tomain
there. - In fact,
main
would be of better interest if it builttxtCorpus
itself before callingstatus_processing
. status_processing
also does more than it advertise as it process the statuses, but also save the result in DB. You should let the caller do whatever they want with the processed results.- All these
print
can be unneeded to someone else. Consider using thelogging
module instead.
- Your top-level code should be under an
if __name__ == '__main__':
clause. So move yourtxtCorpus
building and your call tomain
there. - In fact,
main
would be of better interest if it builttxtCorpus
itself before callingstatus_processing
. status_processing
also does more than it advertise as it process the statuses, but also save the result in DB. You should let the caller do whatever they want with the processed results.- All these
print
can be unneeded to someone else. Consider using thelogging
module instead.
- Your top-level code should be under an
if __name__ == '__main__':
clause. So move yourtxtCorpus
building and your call tomain
there. - In fact,
main
would be of better interest if it builttxtCorpus
itself before callingstatus_processing
. status_processing
also does more than it advertise as it process the statuses, but also save the result in DB. You should let the caller do whatever they want with the processed results.- All these
print
can be unneeded to someone else. Consider using thelogging
module instead.
#spellcorrect.py
SpellCorrect
should not be a class. Your two "working" methods (train
andedit1
) does not referenceself
at all and the other ones only useself
for its namespace. You should provide functions instead.- As far as I can tell, the
words
method is not used anymore since you commented it in the building ofNWORDS
. alphabet
is better imported fromstring
:from string import ascii_lowercase as alphabet
.- I don't understand the definition of
model
intrain
. Why give a score of1
to missing features; and so a score of2
for features encountered once? Moreover, it the aim oftrain
is to count how many times a given feature appears infeatures
, you'd be better off using acollection.Counter
. - You will have a better memory footprint if you turn
edit1
into a generator. Justyield
(andyield from
in Python 3) computed elements instead of storing them in a list. - Turning
edit1
into a generator will allow you to do so withedit2
without requiring it to filter elements itself. And let this job toknown
alone. Avoiding the discrepancy between how you build your words. - In
edit1
, you can iterate more easily overwords
and still get the index usingenumerate
. It can simplify some of your checks.
import collections
from string import ascii_lowercase as alphabet
from nltk.corpus import floresta
NWORDS = collections.Counter(floresta.words())
def edits1(word):
for i, letter in enumerate(word):
begin, end = word[:i], word[i+1:]
yield begin + end # delete
if end:
yield begin + end[0] + letter + end[1:] # transpose
else:
for other in alphabet:
yield begin + letter + other # insert at the end
for other in alphabet:
yield begin + other + end # replace
yield begin + other + letter + end # insert before the current letter
def edits2(word):
for editted_once in edits1(word):
for editted_twice in edits1(editted_once):
yield editted_twice
def known(words):
return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known(edits2(word)) or [word]
return max(candidates, key=self.NWORDS.get)
Main file
- Your top-level code should be under an
if __name__ == '__main__':
clause. So move yourtxtCorpus
building and your call tomain
there. - In fact,
main
would be of better interest if it builttxtCorpus
itself before callingstatus_processing
. status_processing
also does more than it advertise as it process the statuses, but also save the result in DB. You should let the caller do whatever they want with the processed results.- All these
print
can be unneeded to someone else. Consider using thelogging
module instead.
def status_processing(corpus):
myCorpus = preprocessing.PreProcessing()
myCorpus.text = str(corpus)
print "Doing the Initial Process..."
myCorpus.initial_processing()
print "Done."
print "----------------------------"
print ("StartingLexical Diversity...")
myCorpus.lexical_diversity()
print "Done"
print "----------------------------"
print "Removing Stopwords..."
myCorpus.stopwords()
print "Done"
print "----------------------------"
print "Lemmatization..."
myCorpus.lemmatization()
print "Feito"
print "----------------------------"
print "Correcting the words..."
myCorpus.spell_correct()
print "Done"
print "----------------------------"
print "Untokenizing..."
word_final = myCorpus.untokenizing()
print "Feito"
print "----------------------------"
return word_final
if __name__ == '__main__':
dtype_dic = {'status_id': str, 'status_message': str, 'status_published': str}
txt_corpus = list(pd.read_csv(
'data/MyCSV.csv', dtype=dtype_dic,
encoding='utf-8', sep=',',
header='infer', engine='c', chunksize=2))
word_final = status_processing(txt_corpus)
print "Saving in DB...."
try:
db.myDB.insert(word_final, continue_on_error=True)
except pymongo.errors.DuplicateKeyError:
pass
print "Insertion in the DB Completed. End of the Pre-Processing Process "
#preprocessing.py
Techniques
should be anenum
. You can useflufl.enum
if you need them in Python 2. But I don't see them used anywhere in the code, you can get rid of that class.- Since it seems that the code is for Python 2, you should have
PreProcessing
inherits fromobject
. - The
text
property ofPreProcessing
does not add value over aself.text
attribute initialized in the constructor. Especially since you need to set it for the other methods to work. pass
is unnecessary for non-empty blocks.tokenizing
offers a choice between two variants, a boolean parameter would be more suited here. And since you seems to use only one of them, you can give it a default value.- I would merge
__init__
andinitial_processing
since this method populate theself.tokens
attribute with the initial set of tokens every other method works with. - Using
raise NotImplementedError
instead ofreturn 'Not implemented yet'
is much more meaningfull. - Consider using list-comprehensions or the
list
constructor instead of manuallyappend
ing items into an empty list.
import nltk
import re
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
import spellcorrect
class PreProcessing():
def __init__(self, text):
soup = BeautifulSoup(text, "html.parser")
#Todo Se quiser salvar os links mudar aqui
self.text = re.sub(r'(http://|https://|www.)[^"\' ]+', " ", soup.get_text())
self.tokens = self.tokenizing()
def lexical_diversity(self):
word_count = len(self.text)
vocab_size = len(set(self.text))
return vocab_size / word_count
def tokenizing(self, use_default_tokenizer=True):
if use_default_tokenizer:
return nltk.tokenize.word_tokenize(self.text)
stok = nltk.data.load('tokenizers/punkt/portuguese.pickle')
return stok.tokenize(self.text)
def stopwords(self):
stopwords = set(nltk.corpus.stopwords.words('portuguese'))
stopwords.update([
'foda', 'caralho', 'porra',
'puta', 'merda', 'cu',
'foder', 'viado', 'cacete'])
self.tokens = [word for word in self.tokens if word not in stopwords]
def stemming(self):
snowball = SnowballStemmer('portuguese')
self.tokens = [snowball.stem(word) for word in self.tokens]
def lemmatization(self):
lemmatizer = WordNetLemmatizer() #'portuguese'
self.tokens = [lemmatizer.lemmatize(word, pos='v') for word in self.tokens]
def part_of_speech_tagging(self):
raise NotImplementedError
def padronizacaoInternetes(self):
raise NotImplementedError
def untokenize(self, words):
"""
Untokenizing a text undoes the tokenizing operation, restoring
punctuation and spaces to the places that people expect them to be.
Ideally, `untokenize(tokenize(text))` should be identical to `text`,
except for line breaks.
"""
text = ' '.join(words)
step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .', '...')
step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"1円2円", step2)
step4 = re.sub(r' ([.,:;?!%]+)$', r"1円", step3)
step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
"can not", "cannot")
step6 = step5.replace(" ` ", " '")
return step6.strip()
def untokenizing(self):
return ' '.join(self.tokens)
def spell_correct(self):
self.tokens = [spellcorrect.correct(word) for word in self.tokens]
#More generic comments
Consider reading (and following) PEP 8, the official Python style guide; especially as regard to:
import
declarations;- whitespace around operators, comas...
- and variable names.
Also consider using docstrings all around your code, it will make it more easier to understand.
lang-py