Splitting large text file and sorting by content

Question 1

I have a large text file (~2GB) full of data. The data (sample below) gives an x, y, z coordinate, and a corresponding result on each line (there is other stuff but I don't care about it). The single large text file is too large to be useful, so I want to split it into several smaller files. However, I want each file to contain all the points on one y-plane. The first few lines of the file are below:

 mcnp version 6 ld=05/08/13 probid = 09/09/15 23:06:39 
 Detector Test 
 Number of histories used for normalizing tallies = 2237295223.00 
 Mesh Tally Number 14 
 photon mesh tally. 
 Tally bin boundaries:
 X direction: -600.00 -598.00 -596.00 ... 1236.00 1238.00 1240.00 1242.00 1244.00 1258.00 1260.00
 Y direction: 0.00 10.00 20.00 ... 740.00 750.00 760.00 770.00 780.00 790.00 800.00 810.00 820.00 830.00 840.00 850.00 860.00
 Z direction: -60.00 -58.00 -56.00 ... 592.00 594.00 596.00 598.00 600.00 
 Energy bin boundaries: 1.00E-03 1.00E+36 
 Energy X Y Z Result Rel Error Volume Rslt * Vol 
 1.000E+36 -599.000 5.000 -59.000 0.00000E+00 0.00000E+00 4.00000E+01 0.00000E+00 
 1.000E+36 -599.000 5.000 -57.000 0.00000E+00 0.00000E+00 4.00000E+01 0.00000E+00 
 1.000E+36 -599.000 5.000 -55.000 0.00000E+00 0.00000E+00 4.00000E+01 0.00000E+00
... and repeat forever...

I've truncated some of it for readability, but you get the idea. The data I want is the four last lines.

The code currently does the following:

Find the line data headers (Energy X Y ...)
Find the y value of the first line of data
Add the data to a list until we find data with a different y value
Dump the list to a file named with the y value, delete the list
Repeat steps 3 and 4 until the end of the file.

Not all the data for each y plane is together, so if I encounter data at a y-value I've seen before, the data is appended to an existing file.

My code is below, it functions, but I feel like I could improve efficiency somewhere (execution took ~30 min). As always, readability/style improvements are welcome, but performance is the primary goal.

import os
with open("meshtal", 'r') as f:
 i = 0
 coords = []
 curY = 0
 for l in f: 
 #If data header already found
 if i:
 line = l.split()
 #If this is the first line of data
 if i == 1:
 curY = line[2]
 coords.extend([(line[1],line[2],line[3],line[4])])
 i += 1
 else:
 #If data has the same y value as previous
 if curY == line[2]:
 coords.extend([(line[1],line[2],line[3],line[4])])
 i += 1
 #New y value, dump existing data to file
 else: 
 fname = "Y={}.txt".format(curY)
 #if y value has already been encountered, append existing file
 if os.path.exists(fname):
 with open("Y={}.txt".format(curY), 'a') as out:
 for coord in coords:
 out.write("{:10}{:10}{:10}{:10}\n".format(*coord)) 
 #New y value, create new file
 else:
 with open("Y={}.txt".format(curY), 'w') as out:
 out.write("X Y Z Result \n")
 for coord in coords:
 out.write("{:10}{:10}{:10}{:10}\n".format(*coord)) 
 i = 1
 coords = []
 curY = line[2]
 coords.extend([(line[1],line[2],line[3],line[4])])
 i += 1 
 #If no data header has been found
 else:
 #If current line is data header, raise flag
 if l.lstrip().startswith("Energy X Y Z Result Rel Error Volume Rslt * Vol"):
 i += 1
 print "found start"

Question 2

You have a lot of nesting going on here. That's generally harder to read and parse, especially when you could actually make liberal use of continue instead. continue will skip to the next iteration of the loop, ignoring all remaining code. So you could move your check for the header file to the top and avoid indentation:

for l in f: 
 #If data header not found
 if not i:
 if l.lstrip().startswith("Energy X Y Z Result Rel Error Volume Rslt * Vol"):
 i += 1
 print "found start"
 continue

Also i is a terrible variable here. i is initially being used to indicate that a line has been found, then seems to become an index value. Instead I would initialise i as your index once this line is found, but use a named boolean like found_header instead. Something that's clear could remove the need for comments since if found_header is self explanatory. Likewise, I think you should use line instead of l. You do use line to replace l later. l in particular can look like a one or an upper case letter i, so it's not clear.

Also there's nothing wrong with doing line = line.split() since you don't need the original value of line after this part.

I'd move i+=1 out of the if else, since it happens in both cases anyway. You can do it at the start of the loop anyway if you just initialise i as 0. Once again, I'd use continue to save a level of nesting, like so:

#If this is the first line of data
i += 1
if i == 1:
 curY = line[2]
 coords.extend([(line[1],line[2],line[3],line[4])])
 continue
#If data has the same y value as previous
if curY == line[2]:
 coords.extend([(line[1],line[2],line[3],line[4])])
 continue
#New y value, dump existing data to file
fname = "Y={}.txt".format(curY)

Also append mode will still create a new empty file if none exists, so you don't need to check for it. Just always open with 'a' and then write your data. You can just check if the file exists beforehand and store the result as a boolean.

#if y value has already been encountered, append existing file
new_file = os.path.exists(fname)
with open("Y={}.txt".format(curY), 'a') as out:
 if new_file:
 out.write("X Y Z Result \n")
 for coord in coords:
 out.write("{:10}{:10}{:10}{:10}\n".format(*coord))

So here's how I'd put together the whole thing:

import os
header = "Energy X Y Z Result Rel Error Volume Rslt * Vol"
with open("meshtal", 'r') as f:
 header_found = False
 i = 0
 coords = []
 curY = 0
 for line in f: 
 if not header_found:
 if line.lstrip().startswith(header):
 print "found start"
 header_found = True
 continue
 line = line.split()
 i += 1
 #If this is the first line of data
 if i == 1:
 curY = line[2]
 coords.extend([(line[1],line[2],line[3],line[4])])
 continue
 #If data has the same y value as previous
 if curY == line[2]:
 coords.extend([(line[1],line[2],line[3],line[4])])
 continue
 #New y value, dump existing data to file
 filename = "Y={}.txt".format(curY)
 new_file = os.path.exists(fname)
 with open("Y={}.txt".format(curY), 'a') as out:
 if new_file:
 out.write("X Y Z Result \n")
 for coord in coords:
 out.write("{:10}{:10}{:10}{:10}\n".format(*coord)) 
 i = 1
 coords = []
 curY = line[2]
 coords.extend([(line[1],line[2],line[3],line[4])])

Question 3

It seems like, since the data is particularly un-ordered, that I end up switching between y-values a lot. Is it worth leaving each output file open? (open it on the first instance of a y-value, then close them all at the end)

Question 4

Will you have a lot of them to open and write a lot of data? I'd have to check how large those objects would be if you were to do that. I'm actually not sure.

Question 5

There are 90 files, and I'm opening and closing each one several times a minute, and the file size goes up between 40 - 70 KB each time.

Question 6

@wnnmaw From what I can see that would save a lot of time, yes. I think it'd be best to store as a dictionary of curY: open(file). However you can't use with in that case, so you should make sure to handle file closing properly. Do you want me to add a section to my answer about this?

Question 7

Storing the files in a dict is a good idea, I can implement that. Thanks

Question 8

Correct me if I am wrong, but it seems that last batch of coords is never written out.

Not using for l in f, but explicitly calling f.readline() significantly simplifies the flow:

with open("meshtal", 'r') as f:
 while not f.readline.lstrip().startswith("..."):
 pass
 # Header found. The rest are data points
 ...

Before first data point is read, curY is undefined. It seems logical to initialize it to None, and abandon the first/not-first line detection (recall that None compares unequal to anything):
```
 line = f.readline().split()
 if curY == line[2]:
 # Y is the same, keep going
 else:
 # dump if necessary; reinitialize coords and curY
```
Since coords.extend() is needed regardless of curY changes, it is better take it out of conditional.

Putting it all together:

 with open("meshtal", 'r') as f:
 while not f.readline.lstrip().startswith("..."):
 pass
 curY = None
 coords = []
 while True:
 line = f.readline()
 if not line:
 dump(curY, coords) # takes care of the last batch
 break
 line = line.split()
 if curY != line[2]:
 dump(curY, coords)
 curY = line[2]
 coords = []
 coords.extend([(line[1],line[2],line[3],line[4])])

score 2 · Accepted Answer · 2015-09-10 16:58:59Z

You have a lot of nesting going on here. That's generally harder to read and parse, especially when you could actually make liberal use of continue instead. continue will skip to the next iteration of the loop, ignoring all remaining code. So you could move your check for the header file to the top and avoid indentation:

for l in f: 
 #If data header not found
 if not i:
 if l.lstrip().startswith("Energy X Y Z Result Rel Error Volume Rslt * Vol"):
 i += 1
 print "found start"
 continue

Also i is a terrible variable here. i is initially being used to indicate that a line has been found, then seems to become an index value. Instead I would initialise i as your index once this line is found, but use a named boolean like found_header instead. Something that's clear could remove the need for comments since if found_header is self explanatory. Likewise, I think you should use line instead of l. You do use line to replace l later. l in particular can look like a one or an upper case letter i, so it's not clear.

Also there's nothing wrong with doing line = line.split() since you don't need the original value of line after this part.

I'd move i+=1 out of the if else, since it happens in both cases anyway. You can do it at the start of the loop anyway if you just initialise i as 0. Once again, I'd use continue to save a level of nesting, like so:

#If this is the first line of data
i += 1
if i == 1:
 curY = line[2]
 coords.extend([(line[1],line[2],line[3],line[4])])
 continue
#If data has the same y value as previous
if curY == line[2]:
 coords.extend([(line[1],line[2],line[3],line[4])])
 continue
#New y value, dump existing data to file
fname = "Y={}.txt".format(curY)

Also append mode will still create a new empty file if none exists, so you don't need to check for it. Just always open with 'a' and then write your data. You can just check if the file exists beforehand and store the result as a boolean.

#if y value has already been encountered, append existing file
new_file = os.path.exists(fname)
with open("Y={}.txt".format(curY), 'a') as out:
 if new_file:
 out.write("X Y Z Result \n")
 for coord in coords:
 out.write("{:10}{:10}{:10}{:10}\n".format(*coord))

So here's how I'd put together the whole thing:

import os
header = "Energy X Y Z Result Rel Error Volume Rslt * Vol"
with open("meshtal", 'r') as f:
 header_found = False
 i = 0
 coords = []
 curY = 0
 for line in f: 
 if not header_found:
 if line.lstrip().startswith(header):
 print "found start"
 header_found = True
 continue
 line = line.split()
 i += 1
 #If this is the first line of data
 if i == 1:
 curY = line[2]
 coords.extend([(line[1],line[2],line[3],line[4])])
 continue
 #If data has the same y value as previous
 if curY == line[2]:
 coords.extend([(line[1],line[2],line[3],line[4])])
 continue
 #New y value, dump existing data to file
 filename = "Y={}.txt".format(curY)
 new_file = os.path.exists(fname)
 with open("Y={}.txt".format(curY), 'a') as out:
 if new_file:
 out.write("X Y Z Result \n")
 for coord in coords:
 out.write("{:10}{:10}{:10}{:10}\n".format(*coord)) 
 i = 1
 coords = []
 curY = line[2]
 coords.extend([(line[1],line[2],line[3],line[4])])

It seems like, since the data is particularly un-ordered, that I end up switching between y-values a lot. Is it worth leaving each output file open? (open it on the first instance of a y-value, then close them all at the end)
Will you have a lot of them to open and write a lot of data? I'd have to check how large those objects would be if you were to do that. I'm actually not sure.
There are 90 files, and I'm opening and closing each one several times a minute, and the file size goes up between 40 - 70 KB each time.
@wnnmaw From what I can see that would save a lot of time, yes. I think it'd be best to store as a dictionary of curY: open(file). However you can't use with in that case, so you should make sure to handle file closing properly. Do you want me to add a section to my answer about this?
Storing the files in a dict is a good idea, I can implement that. Thanks

Stack Exchange Network

Splitting large text file and sorting by content

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Splitting large text file and sorting by content

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions