MemoryError in Python

Question 1

I have text file, its size is 300 MB. I want to read it and then print 50 most frequently used words. When i run the program it gives me MemoryError. My code is as under:-

import sys, string 
import codecs 
import re
from collections import Counter
import collections
import itertools
import csv
import re
import unicodedata
words_1800 = []
with open('E:\\Book\1800円.txt', "r", encoding='ISO-8859-1') as File_1800:
 for line in File_1800:
 sepFile_1800 = line.lower()
 words_1800.extend(re.findall('\w+', sepFile_1800))
for wrd_1800 in [words_1800]:
 long_1800=[w for w in words_1800 if len(w)>3]
 common_words_1800 = dict(Counter(long_1800).most_common(50))
print(common_words_1800)

It give me the following error:-

Traceback (most recent call last):
File "C:\Python34\CommonWords.py", line 17, in <module>
words_1800.extend(re.findall('\w+', sepFile_1800))
MemoryError

Question 2

What is the for wrd_1800 in [words_1800] supposed to do, exactly?

Question 3

What's your file contents look like? can you add a sample data to your question?

Question 4

Its a for loop which can print those words which length are more than 3. I also try to remove this, but when i run it, it stuck in a loop.

Question 5

@Kasramvd yes. This file contains some books which are published in 18 century. It look likes this. "EVERY MAN IN HIS HUMOUR By Ben Jonson INTRODUCTION THE greatest of English dramatists except Shakespeare, the first literary dictator and poet-laureate, a writer of verse, prose, satire, and criticism who most potently of all the men of his time affected the subsequent course of English letters: such was Ben Jonson, and as such his strong personality assumes an interest to us almost unparalleled, at least in his age. "

Question 6

I think that words_1800.extend(re.findall('\w+', sepFile_1800)) is giving an endless loop.

Question 7

You can use a generator container instead of a list to store the result of re.findall which is much optimized in terms of memory use, you can also use re.finditer instead of findall which returns an iterator.

with open('E:\\Book\1800円.txt', "r", encoding='ISO-8859-1') as File_1800:
 words_1800=(re.findall('\w+', line.lower()) for line in File_1800)

Then the words_1800 will be an iterator contain lists of founded words or use

with open('E:\\Book\1800円.txt', "r", encoding='ISO-8859-1') as File_1800:
 words_1800=(re.finditer('\w+', line.lower()) for line in File_1800)

to get an iterator contains iterators.

Question 8

Traceback (most recent call last): File "C:\Python34\CommonWords.py", line 21, in <module> long_1800=[w for w in wrd_1800 if len(w)>3] File "C:\Python34\CommonWords.py", line 21, in <listcomp> long_1800=[w for w in wrd_1800 if len(w)>3] File "C:\Python34\CommonWords.py", line 17, in <genexpr> words_1800=(re.finditer('\w+', line.lower()) for line in File_1800) ValueError: I/O operation on closed file.

Question 9

@Alam you should put the last line inside the with.

Question 10

Still the error "ValueError: I/O operation on closed file."

Question 11

You must also consume the words_1800 iterator within the with. I.e. before the file gets closed.

Question 12

@MathiasEttinger can you please explain with the help of code? I didnt pick your suggestion :)

Question 13

You can use the Counter upfront saving you memory from using intermediate lists (especially words_1800 which is as big as the file you’re reading):

common_words_1800 = Counter()
with open('E:\\Book\1800円.txt', "r", encoding='ISO-8859-1') as File_1800:
 for line in File_1800:
 for match in re.finditer(r'\w+', line.lower()):
 word = match.group()
 if len(word) > 3:
 common_words_1800[word] += 1
print(common_words_1800.most_common(50))

Question 14

If your file contains ascii you don't need a regex, you can split the words and rstrip the punctuation creating your Counter with a generator expression:

from string import punctuation
from collections import Counter
with open('E:\\Book\1800円.txt') as f:
 cn = Counter(wrd for line in f for wrd in (w.rstrip(punctuation)
 for w in line.lower().split()) if len(wrd) > 3)
 print(cn.most_common(50))

If you were using a regex you should compile it first and you can use it with a generator:

from collections import Counter
import re
with open('E:\\Book\1800円.txt') as f:
 r = re.compile("\w+")
 cn = Counter(wrd for line in f 
 for wrd in r.findall(line) if len(wrd) > 3)
 print(cn.most_common(50))

Question 15

Your code is working good, however it looks a little bit memory inefficient. If your file has 300 MB then there can be a lot of words to process. Try to use suggestions given by @Kasramvd. It seems to be a good idea to use iterators instead of full lists.

In addition, here is a fine blog post about checking memory usage and profiling applications in python - Python - memory usage.

Kasravnd 108k19 gold badges167 silver badges195 bronze badges · Accepted Answer · 2015-09-17 09:15:10Z

4

You can use a generator container instead of a list to store the result of re.findall which is much optimized in terms of memory use, you can also use re.finditer instead of findall which returns an iterator.

with open('E:\\Book\1800円.txt', "r", encoding='ISO-8859-1') as File_1800:
 words_1800=(re.findall('\w+', line.lower()) for line in File_1800)

Then the words_1800 will be an iterator contain lists of founded words or use

with open('E:\\Book\1800円.txt', "r", encoding='ISO-8859-1') as File_1800:
 words_1800=(re.finditer('\w+', line.lower()) for line in File_1800)

to get an iterator contains iterators.

Share

Improve this answer

edited Sep 17, 2015 at 9:26

answered Sep 17, 2015 at 9:15

Kasravnd's user avatar

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Alam

Alam Over a year ago

Traceback (most recent call last): File "C:\Python34\CommonWords.py", line 21, in <module> long_1800=[w for w in wrd_1800 if len(w)>3] File "C:\Python34\CommonWords.py", line 21, in <listcomp> long_1800=[w for w in wrd_1800 if len(w)>3] File "C:\Python34\CommonWords.py", line 17, in <genexpr> words_1800=(re.finditer('\w+', line.lower()) for line in File_1800) ValueError: I/O operation on closed file.

2015年09月17日T09:21:41.407Z+00:00

301_Moved_Permanently

301_Moved_Permanently Over a year ago

@Alam you should put the last line inside the with.

2015年09月17日T09:24:03.15Z+00:00

Alam

Alam Over a year ago

Still the error "ValueError: I/O operation on closed file."

2015年09月17日T09:32:35.847Z+00:00

301_Moved_Permanently

301_Moved_Permanently Over a year ago

You must also consume the words_1800 iterator within the with. I.e. before the file gets closed.

2015年09月17日T09:35:15.363Z+00:00

Alam

Alam Over a year ago

@MathiasEttinger can you please explain with the help of code? I didnt pick your suggestion :)

2015年09月17日T09:38:38.69Z+00:00

|

CollectivesTM on Stack Overflow

MemoryError in Python

4 Answers 4

6 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

4 Answers 4

6 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related