I have a large text file (~2GB) full of data. The data (sample below) gives an x, y, z coordinate, and a corresponding result on each line (there is other stuff but I don't care about it). The single large text file is too large to be useful, so I want to split it into several smaller files. However, I want each file to contain all the points on one y-plane. The first few lines of the file are below:
mcnp version 6 ld=05/08/13 probid = 09/09/15 23:06:39
Detector Test
Number of histories used for normalizing tallies = 2237295223.00
Mesh Tally Number 14
photon mesh tally.
Tally bin boundaries:
X direction: -600.00 -598.00 -596.00 ... 1236.00 1238.00 1240.00 1242.00 1244.00 1258.00 1260.00
Y direction: 0.00 10.00 20.00 ... 740.00 750.00 760.00 770.00 780.00 790.00 800.00 810.00 820.00 830.00 840.00 850.00 860.00
Z direction: -60.00 -58.00 -56.00 ... 592.00 594.00 596.00 598.00 600.00
Energy bin boundaries: 1.00E-03 1.00E+36
Energy X Y Z Result Rel Error Volume Rslt * Vol
1.000E+36 -599.000 5.000 -59.000 0.00000E+00 0.00000E+00 4.00000E+01 0.00000E+00
1.000E+36 -599.000 5.000 -57.000 0.00000E+00 0.00000E+00 4.00000E+01 0.00000E+00
1.000E+36 -599.000 5.000 -55.000 0.00000E+00 0.00000E+00 4.00000E+01 0.00000E+00
... and repeat forever...
I've truncated some of it for readability, but you get the idea. The data I want is the four last lines.
The code currently does the following:
- Find the line data headers (
Energy X Y ...
) - Find the y value of the first line of data
- Add the data to a list until we find data with a different y value
- Dump the list to a file named with the y value, delete the list
- Repeat steps 3 and 4 until the end of the file.
Not all the data for each y plane is together, so if I encounter data at a y-value I've seen before, the data is appended to an existing file.
My code is below, it functions, but I feel like I could improve efficiency somewhere (execution took ~30 min). As always, readability/style improvements are welcome, but performance is the primary goal.
import os
with open("meshtal", 'r') as f:
i = 0
coords = []
curY = 0
for l in f:
#If data header already found
if i:
line = l.split()
#If this is the first line of data
if i == 1:
curY = line[2]
coords.extend([(line[1],line[2],line[3],line[4])])
i += 1
else:
#If data has the same y value as previous
if curY == line[2]:
coords.extend([(line[1],line[2],line[3],line[4])])
i += 1
#New y value, dump existing data to file
else:
fname = "Y={}.txt".format(curY)
#if y value has already been encountered, append existing file
if os.path.exists(fname):
with open("Y={}.txt".format(curY), 'a') as out:
for coord in coords:
out.write("{:10}{:10}{:10}{:10}\n".format(*coord))
#New y value, create new file
else:
with open("Y={}.txt".format(curY), 'w') as out:
out.write("X Y Z Result \n")
for coord in coords:
out.write("{:10}{:10}{:10}{:10}\n".format(*coord))
i = 1
coords = []
curY = line[2]
coords.extend([(line[1],line[2],line[3],line[4])])
i += 1
#If no data header has been found
else:
#If current line is data header, raise flag
if l.lstrip().startswith("Energy X Y Z Result Rel Error Volume Rslt * Vol"):
i += 1
print "found start"
2 Answers 2
You have a lot of nesting going on here. That's generally harder to read and parse, especially when you could actually make liberal use of continue
instead. continue
will skip to the next iteration of the loop, ignoring all remaining code. So you could move your check for the header file to the top and avoid indentation:
for l in f:
#If data header not found
if not i:
if l.lstrip().startswith("Energy X Y Z Result Rel Error Volume Rslt * Vol"):
i += 1
print "found start"
continue
Also i
is a terrible variable here. i
is initially being used to indicate that a line has been found, then seems to become an index value. Instead I would initialise i
as your index once this line is found, but use a named boolean like found_header
instead. Something that's clear could remove the need for comments since if found_header
is self explanatory. Likewise, I think you should use line
instead of l
. You do use line
to replace l
later. l
in particular can look like a one or an upper case letter i, so it's not clear.
Also there's nothing wrong with doing line = line.split()
since you don't need the original value of line after this part.
I'd move i+=1
out of the if else
, since it happens in both cases anyway. You can do it at the start of the loop anyway if you just initialise i
as 0
. Once again, I'd use continue to save a level of nesting, like so:
#If this is the first line of data
i += 1
if i == 1:
curY = line[2]
coords.extend([(line[1],line[2],line[3],line[4])])
continue
#If data has the same y value as previous
if curY == line[2]:
coords.extend([(line[1],line[2],line[3],line[4])])
continue
#New y value, dump existing data to file
fname = "Y={}.txt".format(curY)
Also append mode will still create a new empty file if none exists, so you don't need to check for it. Just always open with 'a'
and then write your data. You can just check if the file exists beforehand and store the result as a boolean.
#if y value has already been encountered, append existing file
new_file = os.path.exists(fname)
with open("Y={}.txt".format(curY), 'a') as out:
if new_file:
out.write("X Y Z Result \n")
for coord in coords:
out.write("{:10}{:10}{:10}{:10}\n".format(*coord))
So here's how I'd put together the whole thing:
import os
header = "Energy X Y Z Result Rel Error Volume Rslt * Vol"
with open("meshtal", 'r') as f:
header_found = False
i = 0
coords = []
curY = 0
for line in f:
if not header_found:
if line.lstrip().startswith(header):
print "found start"
header_found = True
continue
line = line.split()
i += 1
#If this is the first line of data
if i == 1:
curY = line[2]
coords.extend([(line[1],line[2],line[3],line[4])])
continue
#If data has the same y value as previous
if curY == line[2]:
coords.extend([(line[1],line[2],line[3],line[4])])
continue
#New y value, dump existing data to file
filename = "Y={}.txt".format(curY)
new_file = os.path.exists(fname)
with open("Y={}.txt".format(curY), 'a') as out:
if new_file:
out.write("X Y Z Result \n")
for coord in coords:
out.write("{:10}{:10}{:10}{:10}\n".format(*coord))
i = 1
coords = []
curY = line[2]
coords.extend([(line[1],line[2],line[3],line[4])])
-
\$\begingroup\$ It seems like, since the data is particularly un-ordered, that I end up switching between y-values a lot. Is it worth leaving each output file open? (open it on the first instance of a y-value, then close them all at the end) \$\endgroup\$wnnmaw– wnnmaw2015年09月10日 17:45:38 +00:00Commented Sep 10, 2015 at 17:45
-
\$\begingroup\$ Will you have a lot of them to open and write a lot of data? I'd have to check how large those objects would be if you were to do that. I'm actually not sure. \$\endgroup\$SuperBiasedMan– SuperBiasedMan2015年09月10日 17:47:19 +00:00Commented Sep 10, 2015 at 17:47
-
\$\begingroup\$ There are 90 files, and I'm opening and closing each one several times a minute, and the file size goes up between 40 - 70 KB each time. \$\endgroup\$wnnmaw– wnnmaw2015年09月10日 17:50:26 +00:00Commented Sep 10, 2015 at 17:50
-
\$\begingroup\$ @wnnmaw From what I can see that would save a lot of time, yes. I think it'd be best to store as a dictionary of
curY: open(file)
. However you can't usewith
in that case, so you should make sure to handle file closing properly. Do you want me to add a section to my answer about this? \$\endgroup\$SuperBiasedMan– SuperBiasedMan2015年09月11日 09:27:25 +00:00Commented Sep 11, 2015 at 9:27 -
\$\begingroup\$ Storing the files in a dict is a good idea, I can implement that. Thanks \$\endgroup\$wnnmaw– wnnmaw2015年09月11日 14:25:38 +00:00Commented Sep 11, 2015 at 14:25
Correct me if I am wrong, but it seems that last batch of coords is never written out.
Not using
for l in f
, but explicitly callingf.readline()
significantly simplifies the flow:with open("meshtal", 'r') as f: while not f.readline.lstrip().startswith("..."): pass # Header found. The rest are data points ...
Before first data point is read,
curY
is undefined. It seems logical to initialize it toNone
, and abandon the first/not-first line detection (recall thatNone
compares unequal to anything):line = f.readline().split() if curY == line[2]: # Y is the same, keep going else: # dump if necessary; reinitialize coords and curY
Since
coords.extend()
is needed regardless ofcurY
changes, it is better take it out of conditional.
Putting it all together:
with open("meshtal", 'r') as f:
while not f.readline.lstrip().startswith("..."):
pass
curY = None
coords = []
while True:
line = f.readline()
if not line:
dump(curY, coords) # takes care of the last batch
break
line = line.split()
if curY != line[2]:
dump(curY, coords)
curY = line[2]
coords = []
coords.extend([(line[1],line[2],line[3],line[4])])