I have a text file of the following form:
('1', '2')
('3', '4')
.
.
.
and i'm trying to get it to look like this:
1 2
3 4
etc...
I've been trying to do this using the re module in python, by chaining together re.sub commands like so:
for line in file:
s = re.sub(r"\(", "", line)
s1 = re.sub(r",", "", s)
s2 = re.sub(r"'", "", s1)
s3 = re.sub(r"\)", "", s2)
output.write(s3)
output.close()
It seems to work great until I get near the end of my output file; then it becomes inconsistent and stops working. I am thinking it is because of the sheer SIZE of the file I am working with; 300MB or approximately 12 million lines.
Can anyone help me confirm that I'm simply running out of memory? Or if it is something else? Suitable alternatives, or ways around this?
4 Answers 4
You could simplify your code by using a simpler regex that finds all numbers in your input:
import re
with open(file_name) as input,open(output_name,'w') as output:
for line in input:
output.write(' '.join(re.findall('\d+', line))
output.write('\n')
Comments
Why don't load them as python tuples with ast.literal_eval
. Also instead of opening and closing the files manually you can use with
statement which close the file at the end of the block :
With open(file_name) as input,open(output_name,'w') as output:
for line in input:
output.write(','.join(ast.literal_eval(line.strip())))
Comments
I would used a namedtuple for better performance. And the code becomes more readable.
# Python 3
from collections import namedtuple
from ast import literal_eval
#...
Row = namedtuple('Row', 'x y')
with open(in_file, 'r') as f, open(out_file, 'w') as output:
for line in f.readlines():
output.write("{0.x} {0.y}".
format(Row._make(literal_eval(line))))
2 Comments
ast.literal_eval
as mentioned by @Kasramvd which converts from the string line to tuple for input in namedtuple
and also consolidate output.write()
This is one way to do it without the re module:
in_file = open(r'd:\temp02円\input.txt', 'r')
out_file = open(r'd:\temp02円\output.txt', 'w')
for line in in_file:
out_file.write(line.replace("'", '').replace('(', '').replace(', ', ' ').replace(')', ''))
out_file.close()
ast.literal_eval
each line and usecsv
to write it back out.output.write(re.sub(r"\(\s*'(\d+)',\s*'(\d+)'\s*\)", r"1円 2円", line))
. But as I say, that's not your problem. You might need to show more of your code to get an answer to that particular issue.