I am studying the techniques of data mining and data processing. I'm doing this through data I've collected and stored in a csv file. The problem is that this filed was very large, to the point of having as astonishing 40 thousand lines of text.
Some of the algorithms in the rendering part are fast and agile, but the part of the orthographic correction of words is laborious. I am using the NLTK package nltk.corpus import forest. So when it comes time to do this step, I daresay it will not end in a timely manner.
From this, I was wondering if someone can help me with a solution where I can read a file line, do the whole process, save it to the bank and then read another line from the file. So by reading line by line and each line do the process. I think this way I can improve the performance of the algorithm.
txtCorpus = []
dtype_dic= {'status_id': str, 'status_message' : str, 'status_published':str}
for csvfile in pd.read_csv('data/MyCSV.csv',dtype=dtype_dic,encoding='utf-8',sep=',', header='infer',engine='c', chunksize=2):
txtCorpus.append(csvfile)
def status_processing(txtCorpus):
myCorpus = preprocessing.PreProcessing()
myCorpus.text = str(txtCorpus)
print "Doing the Initial Process..."
myCorpus.initial_processing()
print "Done."
print "----------------------------"
print ("StartingLexical Diversity...")
myCorpus.lexical_diversity()
print "Done"
print "----------------------------"
print "Removing Stopwords..."
myCorpus.stopwords()
print "Done"
print "----------------------------"
print "Lemmatization..."
myCorpus.lemmatization()
print "Feito"
print "----------------------------"
print "Correcting the words..."
myCorpus.spell_correct()
print "Done"
print "----------------------------"
print "Untokenizing..."
word_final = myCorpus.untokenizing()
print "Feito"
print "----------------------------"
print "Saving in DB...."
try:
db.myDB.insert(word_final, continue_on_error=True)
except pymongo.errors.DuplicateKeyError:
pass
print "Insertion in the BB Completed. End of the Pre-Processing Process "
def main():
status_processing(txtCorpus)
main()
I believe that by visualizing the code, you can better understand what I explained above. I thought about doing a for
where I read a line and passed it on def status_processing(txtCorpus):
and so, I repeated the process until the end. But I could not reach a solution.
preprocessing file:
import nltk,re, htmlentitydefs
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
import spellcorrect
class Techniques(object):
Lemmatizing = 1
Stopwords = 2
Stemming = 3
Spellcorrect = 4
def __init__(self, Type):
self.value = Type
def __str__(self):
if self.value == Techniques.Lemmatizing:
return 'Lemmatizing'
if self.value == Techniques.Stopwords:
return 'Stopwords'
if self.value == Techniques.Stemming:
return 'Stemming'
if self.value == Techniques.Spellcorrect:
return 'Spell Correct'
def __eq__(self,y):
return self.value==y.value
class PreProcessing():
@property
def text(self):
return self.__text
@text.setter
def text(self, text):
self.__text = text
tokens = None
def initial_processing(self):
soup = BeautifulSoup(self.text,"html.parser")
self.text = soup.get_text()
#Todo Se quiser salvar os links mudar aqui
self.text = re.sub(r'(http://|https://|www.)[^"\' ]+', " ", self.text)
self.tokens = self.tokenizing(1, self.text)
pass
def lexical_diversity(self):
word_count = len(self.text)
vocab_size = len(set(self.text))
return vocab_size / word_count
def tokenizing(self, type, text):
if (type == 1):
return nltk.tokenize.word_tokenize(text)
elif (type == 2):
stok = nltk.data.load('tokenizers/punkt/portuguese.pickle')
#stok = nltk.PunktSentenceTokenizer(train)
return stok.tokenize(text)
def stopwords(self):
stopwords = nltk.corpus.stopwords.words('portuguese')
stopWords = set(stopwords)
palavroesPortugues = ['foda','caralho', 'porra', 'puta', 'merda', 'cu', 'foder', 'viado', 'cacete']
stopWords.update(palavroesPortugues)
filteredWords = []
for word in self.tokens:
if word not in stopWords:
filteredWords.append(word)
self.tokens = filteredWords
def stemming(self):
snowball = SnowballStemmer('portuguese')
stemmedWords = []
for word in self.tokens:
stemmedWords.append(snowball.stem(word))
self.tokens = stemmedWords
def lemmatization(self):
lemmatizer = WordNetLemmatizer()#'portuguese'
lemmatizedWords = []
for word in self.tokens:
lemmatizedWords.append(lemmatizer.lemmatize(word, pos='v'))
self.tokens = lemmatizedWords
def part_of_speech_tagging(self):
return 'Not implemented yet'
def padronizacaoInternetes(self):
return 'Not implementes yet'
def untokenize(self, words):
"""
Untokenizing a text undoes the tokenizing operation, restoring
punctuation and spaces to the places that people expect them to be.
Ideally, `untokenize(tokenize(text))` should be identical to `text`,
except for line breaks.
"""
text = ' '.join(words)
step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .', '...')
step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"1円2円", step2)
step4 = re.sub(r' ([.,:;?!%]+)$', r"1円", step3)
step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
"can not", "cannot")
step6 = step5.replace(" ` ", " '")
return step6.strip()
def untokenizing(self):
return ' '.join(self.tokens)
#return self.untokenize(self.tokens)
#return tokenize.untokenize(self.tokens)
def spell_correct(self):
correctedWords = []
spell = spellcorrect.SpellCorrect()
for word in self.tokens:
correctedWords.append(spell.correct(word))
self.tokens = correctedWords
spellcorret file:
import re, collections
from nltk.corpus import floresta
class SpellCorrect:
def words(self, text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(floresta.words()) #words(file('big.txt').read())
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(self, word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in self.alphabet if b]
inserts = [a + c + b for a, b in splits for c in self.alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(self, word):
return set(e2 for e1 in self.edits1(word) for e2 in self.edits1(e1) if e2 in self.NWORDS)
def known(self, words): return set(w for w in words if w in self.NWORDS)
def correct(self, word):
candidates = self.known([word]) or self.known(self.edits1(word)) or self.known_edits2(word) or [word]
return max(candidates, key=self.NWORDS.get)
1 Answer 1
spellcorrect.py
SpellCorrect
should not be a class. Your two "working" methods (train
andedit1
) does not referenceself
at all and the other ones only useself
for its namespace. You should provide functions instead.- As far as I can tell, the
words
method is not used anymore since you commented it in the building ofNWORDS
. alphabet
is better imported fromstring
:from string import ascii_lowercase as alphabet
.- I don't understand the definition of
model
intrain
. Why give a score of1
to missing features; and so a score of2
for features encountered once? Moreover, it the aim oftrain
is to count how many times a given feature appears infeatures
, you'd be better off using acollection.Counter
. - You will have a better memory footprint if you turn
edit1
into a generator. Justyield
(andyield from
in Python 3) computed elements instead of storing them in a list. - Turning
edit1
into a generator will allow you to do so withedit2
without requiring it to filter elements itself. And let this job toknown
alone. Avoiding the discrepancy between how you build your words. - In
edit1
, you can iterate more easily overwords
and still get the index usingenumerate
. It can simplify some of your checks.
import collections
from string import ascii_lowercase as alphabet
from nltk.corpus import floresta
NWORDS = collections.Counter(floresta.words())
def edits1(word):
for i, letter in enumerate(word):
begin, end = word[:i], word[i+1:]
yield begin + end # delete
if end:
yield begin + end[0] + letter + end[1:] # transpose
else:
for other in alphabet:
yield begin + letter + other # insert at the end
for other in alphabet:
yield begin + other + end # replace
yield begin + other + letter + end # insert before the current letter
def edits2(word):
for editted_once in edits1(word):
for editted_twice in edits1(editted_once):
yield editted_twice
def known(words):
return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known(edits2(word)) or [word]
return max(candidates, key=self.NWORDS.get)
Main file
- Your top-level code should be under an
if __name__ == '__main__':
clause. So move yourtxtCorpus
building and your call tomain
there. - In fact,
main
would be of better interest if it builttxtCorpus
itself before callingstatus_processing
. status_processing
also does more than it advertise as it process the statuses, but also save the result in DB. You should let the caller do whatever they want with the processed results.- All these
print
can be unneeded to someone else. Consider using thelogging
module instead.
def status_processing(corpus):
myCorpus = preprocessing.PreProcessing()
myCorpus.text = str(corpus)
print "Doing the Initial Process..."
myCorpus.initial_processing()
print "Done."
print "----------------------------"
print ("StartingLexical Diversity...")
myCorpus.lexical_diversity()
print "Done"
print "----------------------------"
print "Removing Stopwords..."
myCorpus.stopwords()
print "Done"
print "----------------------------"
print "Lemmatization..."
myCorpus.lemmatization()
print "Feito"
print "----------------------------"
print "Correcting the words..."
myCorpus.spell_correct()
print "Done"
print "----------------------------"
print "Untokenizing..."
word_final = myCorpus.untokenizing()
print "Feito"
print "----------------------------"
return word_final
if __name__ == '__main__':
dtype_dic = {'status_id': str, 'status_message': str, 'status_published': str}
txt_corpus = list(pd.read_csv(
'data/MyCSV.csv', dtype=dtype_dic,
encoding='utf-8', sep=',',
header='infer', engine='c', chunksize=2))
word_final = status_processing(txt_corpus)
print "Saving in DB...."
try:
db.myDB.insert(word_final, continue_on_error=True)
except pymongo.errors.DuplicateKeyError:
pass
print "Insertion in the DB Completed. End of the Pre-Processing Process "
preprocessing.py
Techniques
should be anenum
. You can useflufl.enum
if you need them in Python 2. But I don't see them used anywhere in the code, you can get rid of that class.- Since it seems that the code is for Python 2, you should have
PreProcessing
inherits fromobject
. - The
text
property ofPreProcessing
does not add value over aself.text
attribute initialized in the constructor. Especially since you need to set it for the other methods to work. pass
is unnecessary for non-empty blocks.tokenizing
offers a choice between two variants, a boolean parameter would be more suited here. And since you seems to use only one of them, you can give it a default value.- I would merge
__init__
andinitial_processing
since this method populate theself.tokens
attribute with the initial set of tokens every other method works with. - Using
raise NotImplementedError
instead ofreturn 'Not implemented yet'
is much more meaningfull. - Consider using list-comprehensions or the
list
constructor instead of manuallyappend
ing items into an empty list.
import nltk
import re
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
import spellcorrect
class PreProcessing():
def __init__(self, text):
soup = BeautifulSoup(text, "html.parser")
#Todo Se quiser salvar os links mudar aqui
self.text = re.sub(r'(http://|https://|www.)[^"\' ]+', " ", soup.get_text())
self.tokens = self.tokenizing()
def lexical_diversity(self):
word_count = len(self.text)
vocab_size = len(set(self.text))
return vocab_size / word_count
def tokenizing(self, use_default_tokenizer=True):
if use_default_tokenizer:
return nltk.tokenize.word_tokenize(self.text)
stok = nltk.data.load('tokenizers/punkt/portuguese.pickle')
return stok.tokenize(self.text)
def stopwords(self):
stopwords = set(nltk.corpus.stopwords.words('portuguese'))
stopwords.update([
'foda', 'caralho', 'porra',
'puta', 'merda', 'cu',
'foder', 'viado', 'cacete'])
self.tokens = [word for word in self.tokens if word not in stopwords]
def stemming(self):
snowball = SnowballStemmer('portuguese')
self.tokens = [snowball.stem(word) for word in self.tokens]
def lemmatization(self):
lemmatizer = WordNetLemmatizer() #'portuguese'
self.tokens = [lemmatizer.lemmatize(word, pos='v') for word in self.tokens]
def part_of_speech_tagging(self):
raise NotImplementedError
def padronizacaoInternetes(self):
raise NotImplementedError
def untokenize(self, words):
"""
Untokenizing a text undoes the tokenizing operation, restoring
punctuation and spaces to the places that people expect them to be.
Ideally, `untokenize(tokenize(text))` should be identical to `text`,
except for line breaks.
"""
text = ' '.join(words)
step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .', '...')
step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"1円2円", step2)
step4 = re.sub(r' ([.,:;?!%]+)$', r"1円", step3)
step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
"can not", "cannot")
step6 = step5.replace(" ` ", " '")
return step6.strip()
def untokenizing(self):
return ' '.join(self.tokens)
def spell_correct(self):
self.tokens = [spellcorrect.correct(word) for word in self.tokens]
More generic comments
Consider reading (and following) PEP 8, the official Python style guide; especially as regard to:
import
declarations;- whitespace around operators, comas...
- and variable names.
Also consider using docstrings all around your code, it will make it more easier to understand.
-
\$\begingroup\$ One thing I noticed later was: I used the pandas instead of the csv module. The reasons: I do not know why, but csv module was not reading my file on utf-8, many of the lines came in this form
\ u0159
. With the panda I got by to read the file in UTF-8, however, while trying to run the insert, the mongo was not accepting the text. I may need to re-use the csv module and find a way to run the encoding. \$\endgroup\$Leandro Santos– Leandro Santos2016年11月12日 15:13:22 +00:00Commented Nov 12, 2016 at 15:13 -
\$\begingroup\$ Trying to implement his notes, in
preprocessing.py
If I merge the__init__
I will have no option to pass a parameter instatus_processing
. According to its: In fact,` main` would be of better interest if it builttxtCorpus
itself before callingstatus_processing
status_processing \$\endgroup\$Leandro Santos– Leandro Santos2016年11月12日 15:55:28 +00:00Commented Nov 12, 2016 at 15:55 -
\$\begingroup\$ @LeandroSantos I have a hard time figuring out what your really mean. If you need an additional parameter to
status_processing
, you can just add it. If this parameter needs to be passed to what was beforeinitial_processing
, just add it to__init__
. The thing is, almost every method inPreProcessing
relly onself.tokens
to be populated with an iterable, so initializing it toNone
is a bad thing. Thus you might as well do your initialization into__init__
. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年11月12日 16:40:29 +00:00Commented Nov 12, 2016 at 16:40 -
\$\begingroup\$ Hmm. Look. By your logic, passing a parameter on
__init__
likeword_final = status_processing(txt_corpus)
the output was:TypeError: status_processing() takes no arguments (1 given)
in this line ´word_final = status_processing(txt_corpus)´. Apparently, you need to pass a parameter on__init__
in the method signature. That's what I meant in the previous comment. And after all, if i pass like thisdef status_processing(txt_corpus):
the output isTypeError: __init__() takes exactly 2 arguments (1 given)
\$\endgroup\$Leandro Santos– Leandro Santos2016年11月12日 18:08:35 +00:00Commented Nov 12, 2016 at 18:08 -
\$\begingroup\$ @LeandroSantos
status_processing
is already defined asdef status_processing(corpus):
. And as regard to the__init__
error, since I modifiedPreProcess
afterwards, you'll need to adapt and changemyCorpus = preprocessing.PreProcessing()
intomyCorpus = preprocessing.PreProcessing(corpus)
(obviously removing the next two lines in doing so). \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年11月12日 18:32:03 +00:00Commented Nov 12, 2016 at 18:32
Explore related questions
See similar questions with these tags.
status_processing()
that is taking a long time, rather than reading the CSV file? If so, I'm not sure we can help you, considering that you haven't shown us the code behindmyCorpus
. \$\endgroup\$status_processing()
The first operations it does in an acceptable time. The problem is when I invoke thespell_correct ()
. Already after I read the entire CSV file, it has to fix everything. \$\endgroup\$