I have written a code which works very good for small data-files (see below). The data file 'WSPL.dat' which I want to evaluate has a size of 2.6 GB and the data file 'VELOC.dat' has a size of 4.3 GB. The production of the data 'inputv.dat' takes almost 45 minutes and it has a size of 2.6 GB. At the final step, the code should produce a 'output_veloc_max.dat'. After 30 hours of calculations the last output could not be produced. Nothing was written in the final data. I can imagine that the opening of such a big data and writing a new data which grows continusly can lead to this problem.
My questions:
Can I optimize this code in such a way that it works faster?
If the code cannot be further optimized, should I rewrite the code in multiprocessing code?
If rewriting for multi processing helps, how should I do it?
from __future__ import print_function
import time
import re
c=[]
def get_num(x):
return int(''.join(ele for ele in x if ele.isdigit()))
with open('WSPL.dat', 'r') as f:
for line in f:
if "ND" in line:
print(line)
c=get_num(line)
print(c)
break
linenum = 0
print("Anzahl der Knoten: ",c)
print('Program started.\n')
maxima = [[float('-inf'), ''] for _ in range(c)]
with open('WSPL.dat') as f:
for line in f:
if line.startswith('TS'):
for maximum, line in zip(maxima, f):
linenum += 1
value = float(line)
if value > maximum[0]:
maximum[:] = value, linenum,line
with open('VELOC.dat', 'r') as f, open('inputv.dat', 'w') as outfile:
for line in f:
try:
line = line.strip()
columns = line.split()
vx = float(columns[0])
vy = float(columns[1])
print("{:.2f}\t{:.2f}".format(vx, vy), file=outfile)
except ValueError:
pass
linenum=[x[1] for x in maxima]
i=1
d = {}
with open('inputv.dat', 'r') as f5, open('output_veloc_max.dat', 'w') as out:
out.write('VECTOR\nND {0:2d}\nST 0\nTS 0.00\n'.format(c))
for num, line1 in enumerate(f5, 1):
if num in linenum:
d[num] = line1
out.writelines([d[i] for i in linenum])
print('Programm ist beendet.')
name3 = input("Eingabedatei drucken ")
A very small example for 'WSPL.dat' can be like this:
SCALAR
ND 3
ST 0
TS 10.00
666.0000
0.0000
0.0000
SCALAR
ND 3
ST 0
TS 3600.47
355.1744
255.0201
255.2748
SCALAR
ND 3
ST 0
TS 7200.42
555.5984
255.4946
355.7014
A very small example for 'VELOC.dat' can be like this:
VECTOR
ND 3
ST 0
TS 10.00
0.000 0.000
0.000 0.000
0.000 0.000
VECTOR
ND 3
ST 0
TS 3600.47
-0.280 1.385
-0.494 1.239
-0.574 1.647
VECTOR
ND 3
ST 0
TS 7200.42
-0.609 2.114
-0.900 1.734
-0.871 2.293
The lines under the TS in VLOC.dat and WSPL.dat belong to a node. for this small example we got 3 nodes (ND=3). The program finds the maximum value for each node in WSPL.dat and writes the line numbers in inputv.dat. Then opens the inputv.dat and VLOC.dat and finds the values for the given line numbers and write in another file 'output_veloc_max.dat'.
-
\$\begingroup\$ Welcome to codereview! Could you please give us some input/output examples ? More, have in mind that your last question is barely on-topic for this sites' purposes. Here, we are supposed to only review the existing code while what you're asking is basically "write this for me in a different way" \$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2016年11月01日 08:47:29 +00:00Commented Nov 1, 2016 at 8:47
-
\$\begingroup\$ Thank you for welcoming me. I have added a very small example. I would appreciate and learn a lot if some experienced programer can write this in a different way. If you think it is not proper, so please give me just a hint if rewriting the code with multiprocessing (threading) help to save time? Which part should I put in different threads? \$\endgroup\$Mohamad Reza Salehi Sadaghiani– Mohamad Reza Salehi Sadaghiani2016年11月01日 09:17:34 +00:00Commented Nov 1, 2016 at 9:17
2 Answers 2
Performance
Your problem report baffles me. You say that you can successfully produce inputv.dat
, a 2.6 GB file, in 45 minutes. Yet, filtering inputv.dat
to produce output_veloc_max.dat
fails after 30 hours? The only possible bottleneck is if num in linenum
— and that would only be inefficient if the number of nodes is huge. To make it work efficiently for a large number of nodes, you could use a set
instead of a list
:
linenum = set(x[1] for x in maxima)
Multiprocessing would not help you much, since the task is not very parallelizable. It is possible to analyze WSPL.dat
in parallel with the VELOC.dat
→inputv.dat
conversion, but producing output_veloc_max.dat
must wait. Since these data files are line-oriented, you cannot easily skip to arbitrary points within them, so you must process VELOC.dat
sequentially. The only significant optimization would be to avoid writing out inputv.dat
as an intermediate result and reading it back in.
Style
Drop imports that you not using, like import time
and import re
. Since you have tagged this question as python-3.x, you also don't need from __future__ import print_function
.
The initialization of c
as c=[]
is garbage. The real assignment is c=get_num(line)
. The get_num
function is weird, in that it gathers up all digits that appear on a line, returning 24
even if the text is "2 cool 4 me"
.
I'm not a fan of your maxima
two-dimensional array. By storing pairs in a list, you end up with code that is obfuscated by subscripts, like if value > maximum[0]
and [x[1] for x in maxima]
. Ideally, lists should be used for homogeneous data of variable length. This pair would be better written as a tuple or a namedtuple
. But since you don't care about the values as soon as you have obtained the line numbers, you might as well make two separate variables max_samples
and max_indices
.
The way you parse VELOC.dat
is odd. You indiscriminately attempt to interpret the first column of every line as a float
, then ignore the line if that fails. I would guess, looking at your excerpt of VELOC.dat
, that a better approach would be to look for a line containing "TS"
, then take the following c lines, where c is the number of nodes — the same strategy that you use when parsing WSPL.dat
.
This program is large enough that it should be split into functions, so that each chunk of code has a name that reflects its purpose, and so that each function has its own local variables.
from itertools import count
def n_lines_after_ts(n, f):
"""
Filter the file, extracting the next n lines after each occurrence of "TS".
Each result is a triple consisting of:
* A sequential sample number (counting up from 0)
* The node number for that sample (cycling 0, 1, ..., n-1)
* The line from the file
"""
line_counter = count()
for line in f:
if line.startswith('TS'):
for node in range(n):
yield next(line_counter), node, next(f)
def num_nodes(wspl_dat):
"""
Parse the file to extract the first number that appears after "ND".
"""
for line in wspl_dat:
if line.startswith('ND'):
return int(line.partition('ND')[-1])
def wspl_max_indices(num_nodes, wspl_dat):
"""
Find the indices of the maximum sample for each node.
"""
max_samples = [float('-inf')] * num_nodes
max_indices = [None] * num_nodes
for line_num, node, sample in n_lines_after_ts(num_nodes, wspl_dat):
sample = float(sample)
if sample > max_samples[node]:
max_samples[node] = sample
max_indices[node] = line_num
return max_indices
def extract_veloc(num_nodes, veloc_dat, indices):
"""
Extract the specified indices from the velocity data.
"""
set_of_indices = set(indices)
lines = {
line_num: sample
for line_num, _, sample in n_lines_after_ts(num_nodes, veloc_dat)
if line_num in set_of_indices
}
return (lines[i] for i in indices)
with open('WSPL.dat') as wspl_dat_file, \
open('VELOC.dat') as veloc_dat_file, \
open('output_veloc_max.dat', 'w') as out:
num_nodes = num_nodes(wspl_dat_file)
print('Anzahl der Knoten: {0}'.format(num_nodes))
max_wspl_indices = wspl_max_indices(num_nodes, wspl_dat_file)
out.write('VECTOR\nND {0:2d}\nST 0\nTS 0.00\n'.format(num_nodes))
out.writelines(extract_veloc(num_nodes, veloc_dat_file, max_wspl_indices))
print('Programm ist beendet.')
-
\$\begingroup\$ WOW! I really enjoyed your analyse! I have learned a lot. I have understood but I need more time to internalise it. \$\endgroup\$Mohamad Reza Salehi Sadaghiani– Mohamad Reza Salehi Sadaghiani2016年11月02日 09:04:37 +00:00Commented Nov 2, 2016 at 9:04
-
\$\begingroup\$ I have a small question. I have run the code which you have written for bigger data. It works perfectly, but at the end the velocieties will be sorted and written in the final output. In function "extract_veloc", the indices will be sorted and extracted from VELOC.dat and written in output. But what I need is the velocities respected to the indicies (without sorting). I tryed to take the following line out "indices = set(indices)", but it did not wrok. \$\endgroup\$Mohamad Reza Salehi Sadaghiani– Mohamad Reza Salehi Sadaghiani2016年11月21日 09:33:20 +00:00Commented Nov 21, 2016 at 9:33
-
\$\begingroup\$ My solution doesn't sort anything.
extract_veloc()
is just a filter. The output will appear in the same order of appearance as the input.indices = set(indices)
is just an optimization to use the right data structure so that thein
operator is fast. \$\endgroup\$200_success– 200_success2016年11月21日 09:44:03 +00:00Commented Nov 21, 2016 at 9:44 -
\$\begingroup\$ Thanks for your reply. For example if the indices of the maxima is [15, 16, 17, 18, 5, 8, 9] the maximum velocities will be appear in this order [5, 8, 9, 15, 16, 17, 18]. I think it is because of the if statement, isn't it? If searches in the list and finds the first integer. Am I correct? \$\endgroup\$Mohamad Reza Salehi Sadaghiani– Mohamad Reza Salehi Sadaghiani2016年11月21日 09:49:27 +00:00Commented Nov 21, 2016 at 9:49
-
\$\begingroup\$ I see. I think I had misunderstood the problem. Corrected in Rev 5. \$\endgroup\$200_success– 200_success2016年11月21日 19:25:08 +00:00Commented Nov 21, 2016 at 19:25
First, some PEP8 style issues and general coding style advices:
- You shouldn't import modules that you're not using (
import time
andimport re
) You should have two newlines before declaring a new function:
This:
c=[] def get_num(x): return int(''.join(ele for ele in x if ele.isdigit()))
Would become:
c=[] def get_num(x): return int(''.join(ele for ele in x if ele.isdigit()))
You should indent your code by 4-spaces in a new block, not 8 as you did in your first
if
block.Be consistent when writing code !
Sometimes you use
with open()
to write to a file, and sometimes you do this:print("{:.2f}\t{:.2f}".format(vx, vy), file=outfile)
. Choose one of them. I'd choose the first one.More, you can directly use
with open(filename)
when reading from a file. This will automatically open the file in read mode.Around
=
operator, you should have a space. Also, after,
you should have a space.Having said this, your code will look like this so far:
from __future__ import print_function def get_num(x): return int(''.join(ele for ele in x if ele.isdigit())) c = [] with open('WSPL.dat') as f: for line in f: if "ND" in line: print(line) c = get_num(line) print(c) break linenum = 0 print("Anzahl der Knoten: ", c) print('Program started.\n') maxima = [[float('-inf'), ''] for _ in range(c)] with open('WSPL.dat') as f: for line in f: if line.startswith('TS'): for maximum, line in zip(maxima, f): linenum += 1 value = float(line) if value > maximum[0]: maximum[:] = value, linenum, line with open('VELOC.dat') as f, open('inputv.dat', 'w') as outfile: for line in f: try: line = line.strip() columns = line.split() vx = float(columns[0]) vy = float(columns[1]) print("{:.2f}\t{:.2f}".format(vx, vy), file=outfile) except ValueError: pass linenum = [x[1] for x in maxima] i = 1 d = {} with open('inputv.dat') as f5, open('output_veloc_max.dat', 'w') as out: out.write('VECTOR\nND {0:2d}\nST 0\nTS 0.00\n'.format(c)) for num, line1 in enumerate(f5, 1): if num in linenum: d[num] = line1 out.writelines([d[i] for i in linenum]) print('Programm ist beendet.') name3 = input("Eingabedatei drucken ")
You don't have any documentation or input / output examples which makes your code really hard to follow.
You also didn't split the logic of your program into functions, which would clearly make this a lot more easy to follow.
- Try to comment out / remove the lines which are there only for debug purposes. This will save a bit of memory.
// I'll continue the review later on as I'm busy right now
Explore related questions
See similar questions with these tags.