Memory Limit Using Regex on massive text file

Asked 10 years ago

Viewed 173 times

I have a text file of the following form:

('1', '2')
('3', '4')
 .
 .
 .

and i'm trying to get it to look like this:

1 2
3 4
etc...

I've been trying to do this using the re module in python, by chaining together re.sub commands like so:

for line in file:
 s = re.sub(r"\(", "", line)
 s1 = re.sub(r",", "", s)
 s2 = re.sub(r"'", "", s1)
 s3 = re.sub(r"\)", "", s2)
 output.write(s3)
output.close()

It seems to work great until I get near the end of my output file; then it becomes inconsistent and stops working. I am thinking it is because of the sheer SIZE of the file I am working with; 300MB or approximately 12 million lines.

Can anyone help me confirm that I'm simply running out of memory? Or if it is something else? Suitable alternatives, or ways around this?

Improve this question

asked Sep 22, 2015 at 15:43

Eli Riekeberg's user avatar

Eli Riekeberg Eli Riekeberg

1997 bronze badges

1

It looks like your file is full of representations of two-tuples of strings representing integers - why?! You could ast.literal_eval each line and use csv to write it back out.

jonrsharpe
– jonrsharpe

2015年09月22日 15:45:47 +00:00
Commented Sep 22, 2015 at 15:45
1

It's processing the file line by line, so I don't see how the size of the file should be causing a problem. Are you sure there isn't something else in your code creating an isue?

lurker
– lurker

2015年09月22日 15:46:12 +00:00
Commented Sep 22, 2015 at 15:46
You can use a single regex: output.write(re.sub(r"\(\s*'(\d+)',\s*'(\d+)'\s*\)", r"1円 2円", line)). But as I say, that's not your problem. You might need to show more of your code to get an answer to that particular issue.

lurker
– lurker

2015年09月22日 16:08:31 +00:00
Commented Sep 22, 2015 at 16:08

Add a comment |

4 Answers 4

Sorted by: Reset to default

You could simplify your code by using a simpler regex that finds all numbers in your input:

import re
with open(file_name) as input,open(output_name,'w') as output:
for line in input:
 output.write(' '.join(re.findall('\d+', line))
 output.write('\n')

Improve this answer

answered Sep 22, 2015 at 15:51

Christian Stade-Schuldt's user avatar

Christian Stade-Schuldt Christian Stade-Schuldt

4,8717 gold badges38 silver badges31 bronze badges

Comments

Why don't load them as python tuples with ast.literal_eval. Also instead of opening and closing the files manually you can use with statement which close the file at the end of the block :

With open(file_name) as input,open(output_name,'w') as output:
 for line in input:
 output.write(','.join(ast.literal_eval(line.strip())))

Improve this answer

answered Sep 22, 2015 at 15:47

Kasravnd's user avatar

Kasravnd Kasravnd

108k19 gold badges165 silver badges194 bronze badges

Comments

I would used a namedtuple for better performance. And the code becomes more readable.

# Python 3
from collections import namedtuple
from ast import literal_eval
#...
Row = namedtuple('Row', 'x y')
with open(in_file, 'r') as f, open(out_file, 'w') as output:
 for line in f.readlines():
 output.write("{0.x} {0.y}".
 format(Row._make(literal_eval(line))))

Improve this answer

edited Sep 23, 2015 at 14:57

answered Sep 22, 2015 at 16:05

siegerts's user avatar

siegerts siegerts

4714 silver badges11 bronze badges

2 Comments

Eli Riekeberg

Eli Riekeberg Over a year ago

I got this error(my first line is 35 characters long): r = Row._make(line) File "<string>", line 21, in _make TypeError: Expected 2 arguments, got 35

2015年09月23日T13:48:46.707Z+00:00

siegerts

siegerts Over a year ago

@EliRiekeberg , Okay, updated to fix that - the answer now converts using ast.literal_eval as mentioned by @Kasramvd which converts from the string line to tuple for input in namedtuple and also consolidate output.write()

2015年09月23日T14:58:33.543Z+00:00

This is one way to do it without the re module:

in_file = open(r'd:\temp02円\input.txt', 'r')
out_file = open(r'd:\temp02円\output.txt', 'w')
for line in in_file:
 out_file.write(line.replace("'", '').replace('(', '').replace(', ', ' ').replace(')', ''))
out_file.close()

Improve this answer

answered Sep 22, 2015 at 18:47

Matej's user avatar

Matej Matej

9424 gold badges14 silver badges22 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Memory Limit Using Regex on massive text file

4 Answers 4

Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

4 Answers 4

Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related