I have text file, its size is 300 MB. I want to read it and then print 50 most frequently used words. When i run the program it gives me MemoryError. My code is as under:-
import sys, string
import codecs
import re
from collections import Counter
import collections
import itertools
import csv
import re
import unicodedata
words_1800 = []
with open('E:\\Book\1800円.txt', "r", encoding='ISO-8859-1') as File_1800:
for line in File_1800:
sepFile_1800 = line.lower()
words_1800.extend(re.findall('\w+', sepFile_1800))
for wrd_1800 in [words_1800]:
long_1800=[w for w in words_1800 if len(w)>3]
common_words_1800 = dict(Counter(long_1800).most_common(50))
print(common_words_1800)
It give me the following error:-
Traceback (most recent call last):
File "C:\Python34\CommonWords.py", line 17, in <module>
words_1800.extend(re.findall('\w+', sepFile_1800))
MemoryError
4 Answers 4
You can use a generator container instead of a list to store the result of re.findall which is much optimized in terms of memory use, you can also use re.finditer instead of findall which returns an iterator.
with open('E:\\Book\1800円.txt', "r", encoding='ISO-8859-1') as File_1800:
words_1800=(re.findall('\w+', line.lower()) for line in File_1800)
Then the words_1800 will be an iterator contain lists of founded words or use
with open('E:\\Book\1800円.txt', "r", encoding='ISO-8859-1') as File_1800:
words_1800=(re.finditer('\w+', line.lower()) for line in File_1800)
to get an iterator contains iterators.
6 Comments
with.words_1800 iterator within the with. I.e. before the file gets closed.You can use the Counter upfront saving you memory from using intermediate lists (especially words_1800 which is as big as the file you’re reading):
common_words_1800 = Counter()
with open('E:\\Book\1800円.txt', "r", encoding='ISO-8859-1') as File_1800:
for line in File_1800:
for match in re.finditer(r'\w+', line.lower()):
word = match.group()
if len(word) > 3:
common_words_1800[word] += 1
print(common_words_1800.most_common(50))
Comments
If your file contains ascii you don't need a regex, you can split the words and rstrip the punctuation creating your Counter with a generator expression:
from string import punctuation
from collections import Counter
with open('E:\\Book\1800円.txt') as f:
cn = Counter(wrd for line in f for wrd in (w.rstrip(punctuation)
for w in line.lower().split()) if len(wrd) > 3)
print(cn.most_common(50))
If you were using a regex you should compile it first and you can use it with a generator:
from collections import Counter
import re
with open('E:\\Book\1800円.txt') as f:
r = re.compile("\w+")
cn = Counter(wrd for line in f
for wrd in r.findall(line) if len(wrd) > 3)
print(cn.most_common(50))
Comments
Your code is working good, however it looks a little bit memory inefficient. If your file has 300 MB then there can be a lot of words to process. Try to use suggestions given by @Kasramvd. It seems to be a good idea to use iterators instead of full lists.
In addition, here is a fine blog post about checking memory usage and profiling applications in python - Python - memory usage.
for wrd_1800 in [words_1800]supposed to do, exactly?words_1800.extend(re.findall('\w+', sepFile_1800))is giving an endless loop.