Matching maximal nodes with velocities

Question 1

I have written a code which works very good for small data-files (see below). The data file 'WSPL.dat' which I want to evaluate has a size of 2.6 GB and the data file 'VELOC.dat' has a size of 4.3 GB. The production of the data 'inputv.dat' takes almost 45 minutes and it has a size of 2.6 GB. At the final step, the code should produce a 'output_veloc_max.dat'. After 30 hours of calculations the last output could not be produced. Nothing was written in the final data. I can imagine that the opening of such a big data and writing a new data which grows continusly can lead to this problem.

My questions:

Can I optimize this code in such a way that it works faster?
If the code cannot be further optimized, should I rewrite the code in multiprocessing code?
If rewriting for multi processing helps, how should I do it?

from __future__ import print_function
import time
import re
c=[]
def get_num(x):
 return int(''.join(ele for ele in x if ele.isdigit()))
with open('WSPL.dat', 'r') as f:
 for line in f:
 if "ND" in line:
 print(line)
 c=get_num(line)
 print(c)
 break
linenum = 0
print("Anzahl der Knoten: ",c)
print('Program started.\n') 
maxima = [[float('-inf'), ''] for _ in range(c)]
with open('WSPL.dat') as f:
 for line in f:
 if line.startswith('TS'):
 for maximum, line in zip(maxima, f):
 linenum += 1
 value = float(line)
 if value > maximum[0]:
 maximum[:] = value, linenum,line
with open('VELOC.dat', 'r') as f, open('inputv.dat', 'w') as outfile:
 for line in f:
 try:
 line = line.strip()
 columns = line.split()
 vx = float(columns[0])
 vy = float(columns[1])
 print("{:.2f}\t{:.2f}".format(vx, vy), file=outfile)
 except ValueError:
 pass
linenum=[x[1] for x in maxima]
i=1
d = {} 
with open('inputv.dat', 'r') as f5, open('output_veloc_max.dat', 'w') as out:
 out.write('VECTOR\nND {0:2d}\nST 0\nTS 0.00\n'.format(c))
 for num, line1 in enumerate(f5, 1):
 if num in linenum:
 d[num] = line1
 out.writelines([d[i] for i in linenum])
print('Programm ist beendet.')
name3 = input("Eingabedatei drucken ")

A very small example for 'WSPL.dat' can be like this:

SCALAR
ND 3
ST 0
TS 10.00
 666.0000
 0.0000
 0.0000
SCALAR
ND 3
ST 0
TS 3600.47
 355.1744
 255.0201
 255.2748
SCALAR
ND 3
ST 0
TS 7200.42
 555.5984
 255.4946
 355.7014

A very small example for 'VELOC.dat' can be like this:

VECTOR
ND 3
ST 0
TS 10.00
 0.000 0.000
 0.000 0.000
 0.000 0.000
VECTOR
ND 3
ST 0
TS 3600.47
 -0.280 1.385
 -0.494 1.239
 -0.574 1.647
VECTOR
ND 3
ST 0
TS 7200.42
 -0.609 2.114
 -0.900 1.734
 -0.871 2.293

The lines under the TS in VLOC.dat and WSPL.dat belong to a node. for this small example we got 3 nodes (ND=3). The program finds the maximum value for each node in WSPL.dat and writes the line numbers in inputv.dat. Then opens the inputv.dat and VLOC.dat and finds the values for the given line numbers and write in another file 'output_veloc_max.dat'.

Question 2

Welcome to codereview! Could you please give us some input/output examples ? More, have in mind that your last question is barely on-topic for this sites' purposes. Here, we are supposed to only review the existing code while what you're asking is basically "write this for me in a different way"

Question 3

Thank you for welcoming me. I have added a very small example. I would appreciate and learn a lot if some experienced programer can write this in a different way. If you think it is not proper, so please give me just a hint if rewriting the code with multiprocessing (threading) help to save time? Which part should I put in different threads?

Question 4

Performance

Your problem report baffles me. You say that you can successfully produce inputv.dat, a 2.6 GB file, in 45 minutes. Yet, filtering inputv.dat to produce output_veloc_max.dat fails after 30 hours? The only possible bottleneck is if num in linenum — and that would only be inefficient if the number of nodes is huge. To make it work efficiently for a large number of nodes, you could use a set instead of a list:

linenum = set(x[1] for x in maxima)

Multiprocessing would not help you much, since the task is not very parallelizable. It is possible to analyze WSPL.dat in parallel with the VELOC.dat→inputv.dat conversion, but producing output_veloc_max.dat must wait. Since these data files are line-oriented, you cannot easily skip to arbitrary points within them, so you must process VELOC.dat sequentially. The only significant optimization would be to avoid writing out inputv.dat as an intermediate result and reading it back in.

Style

Drop imports that you not using, like import time and import re. Since you have tagged this question as python-3.x, you also don't need from __future__ import print_function.

The initialization of c as c=[] is garbage. The real assignment is c=get_num(line). The get_num function is weird, in that it gathers up all digits that appear on a line, returning 24 even if the text is "2 cool 4 me".

I'm not a fan of your maxima two-dimensional array. By storing pairs in a list, you end up with code that is obfuscated by subscripts, like if value > maximum[0] and [x[1] for x in maxima]. Ideally, lists should be used for homogeneous data of variable length. This pair would be better written as a tuple or a namedtuple. But since you don't care about the values as soon as you have obtained the line numbers, you might as well make two separate variables max_samples and max_indices.

The way you parse VELOC.dat is odd. You indiscriminately attempt to interpret the first column of every line as a float, then ignore the line if that fails. I would guess, looking at your excerpt of VELOC.dat, that a better approach would be to look for a line containing "TS", then take the following c lines, where c is the number of nodes — the same strategy that you use when parsing WSPL.dat.

This program is large enough that it should be split into functions, so that each chunk of code has a name that reflects its purpose, and so that each function has its own local variables.

from itertools import count
def n_lines_after_ts(n, f):
 """
 Filter the file, extracting the next n lines after each occurrence of "TS".
 Each result is a triple consisting of:
 * A sequential sample number (counting up from 0)
 * The node number for that sample (cycling 0, 1, ..., n-1)
 * The line from the file
 """
 line_counter = count()
 for line in f:
 if line.startswith('TS'):
 for node in range(n):
 yield next(line_counter), node, next(f)
def num_nodes(wspl_dat):
 """
 Parse the file to extract the first number that appears after "ND".
 """
 for line in wspl_dat:
 if line.startswith('ND'):
 return int(line.partition('ND')[-1])
def wspl_max_indices(num_nodes, wspl_dat):
 """
 Find the indices of the maximum sample for each node.
 """
 max_samples = [float('-inf')] * num_nodes
 max_indices = [None] * num_nodes
 for line_num, node, sample in n_lines_after_ts(num_nodes, wspl_dat):
 sample = float(sample)
 if sample > max_samples[node]:
 max_samples[node] = sample
 max_indices[node] = line_num
 return max_indices
def extract_veloc(num_nodes, veloc_dat, indices):
 """
 Extract the specified indices from the velocity data.
 """
 set_of_indices = set(indices)
 lines = {
 line_num: sample
 for line_num, _, sample in n_lines_after_ts(num_nodes, veloc_dat)
 if line_num in set_of_indices
 }
 return (lines[i] for i in indices)
with open('WSPL.dat') as wspl_dat_file, \
 open('VELOC.dat') as veloc_dat_file, \
 open('output_veloc_max.dat', 'w') as out:
 num_nodes = num_nodes(wspl_dat_file)
 print('Anzahl der Knoten: {0}'.format(num_nodes))
 max_wspl_indices = wspl_max_indices(num_nodes, wspl_dat_file)
 out.write('VECTOR\nND {0:2d}\nST 0\nTS 0.00\n'.format(num_nodes))
 out.writelines(extract_veloc(num_nodes, veloc_dat_file, max_wspl_indices))
print('Programm ist beendet.')

Question 5

WOW! I really enjoyed your analyse! I have learned a lot. I have understood but I need more time to internalise it.

Question 6

I have a small question. I have run the code which you have written for bigger data. It works perfectly, but at the end the velocieties will be sorted and written in the final output. In function "extract_veloc", the indices will be sorted and extracted from VELOC.dat and written in output. But what I need is the velocities respected to the indicies (without sorting). I tryed to take the following line out "indices = set(indices)", but it did not wrok.

Question 7

My solution doesn't sort anything. extract_veloc() is just a filter. The output will appear in the same order of appearance as the input. indices = set(indices) is just an optimization to use the right data structure so that the in operator is fast.

Question 8

Thanks for your reply. For example if the indices of the maxima is [15, 16, 17, 18, 5, 8, 9] the maximum velocities will be appear in this order [5, 8, 9, 15, 16, 17, 18]. I think it is because of the if statement, isn't it? If searches in the list and finds the first integer. Am I correct?

Question 9

I see. I think I had misunderstood the problem. Corrected in Rev 5.

Question 10

First, some PEP8 style issues and general coding style advices:

You shouldn't import modules that you're not using (import time and import re)

You should have two newlines before declaring a new function:

This:

c=[]
def get_num(x):
 return int(''.join(ele for ele in x if ele.isdigit()))

Would become:

c=[]
def get_num(x):
 return int(''.join(ele for ele in x if ele.isdigit()))

You should indent your code by 4-spaces in a new block, not 8 as you did in your first if block.
Be consistent when writing code !

Sometimes you use with open() to write to a file, and sometimes you do this: print("{:.2f}\t{:.2f}".format(vx, vy), file=outfile). Choose one of them. I'd choose the first one.

More, you can directly use with open(filename) when reading from a file. This will automatically open the file in read mode.

Around = operator, you should have a space. Also, after , you should have a space.

Having said this, your code will look like this so far:

from __future__ import print_function
def get_num(x):
 return int(''.join(ele for ele in x if ele.isdigit()))
c = []
with open('WSPL.dat') as f:
 for line in f:
 if "ND" in line:
 print(line)
 c = get_num(line)
 print(c)
 break
linenum = 0
print("Anzahl der Knoten: ", c)
print('Program started.\n')
maxima = [[float('-inf'), ''] for _ in range(c)]
with open('WSPL.dat') as f:
 for line in f:
 if line.startswith('TS'):
 for maximum, line in zip(maxima, f):
 linenum += 1
 value = float(line)
 if value > maximum[0]:
 maximum[:] = value, linenum, line
with open('VELOC.dat') as f, open('inputv.dat', 'w') as outfile:
 for line in f:
 try:
 line = line.strip()
 columns = line.split()
 vx = float(columns[0])
 vy = float(columns[1])
 print("{:.2f}\t{:.2f}".format(vx, vy), file=outfile)
 except ValueError:
 pass
linenum = [x[1] for x in maxima]
i = 1
d = {}
with open('inputv.dat') as f5, open('output_veloc_max.dat', 'w') as out:
 out.write('VECTOR\nND {0:2d}\nST 0\nTS 0.00\n'.format(c))
 for num, line1 in enumerate(f5, 1):
 if num in linenum:
 d[num] = line1
 out.writelines([d[i] for i in linenum])
print('Programm ist beendet.')
name3 = input("Eingabedatei drucken ")

You don't have any documentation or input / output examples which makes your code really hard to follow.

You also didn't split the logic of your program into functions, which would clearly make this a lot more easy to follow.
Try to comment out / remove the lines which are there only for debug purposes. This will save a bit of memory.

// I'll continue the review later on as I'm busy right now

200_success 200_success 146k22 gold badges190 silver badges479 bronze badges · Accepted Answer · 2016-11-01 21:43:22Z

Performance

Your problem report baffles me. You say that you can successfully produce inputv.dat, a 2.6 GB file, in 45 minutes. Yet, filtering inputv.dat to produce output_veloc_max.dat fails after 30 hours? The only possible bottleneck is if num in linenum — and that would only be inefficient if the number of nodes is huge. To make it work efficiently for a large number of nodes, you could use a set instead of a list:

linenum = set(x[1] for x in maxima)

Multiprocessing would not help you much, since the task is not very parallelizable. It is possible to analyze WSPL.dat in parallel with the VELOC.dat→inputv.dat conversion, but producing output_veloc_max.dat must wait. Since these data files are line-oriented, you cannot easily skip to arbitrary points within them, so you must process VELOC.dat sequentially. The only significant optimization would be to avoid writing out inputv.dat as an intermediate result and reading it back in.

Style

Drop imports that you not using, like import time and import re. Since you have tagged this question as python-3.x, you also don't need from __future__ import print_function.

The initialization of c as c=[] is garbage. The real assignment is c=get_num(line). The get_num function is weird, in that it gathers up all digits that appear on a line, returning 24 even if the text is "2 cool 4 me".

I'm not a fan of your maxima two-dimensional array. By storing pairs in a list, you end up with code that is obfuscated by subscripts, like if value > maximum[0] and [x[1] for x in maxima]. Ideally, lists should be used for homogeneous data of variable length. This pair would be better written as a tuple or a namedtuple. But since you don't care about the values as soon as you have obtained the line numbers, you might as well make two separate variables max_samples and max_indices.

The way you parse VELOC.dat is odd. You indiscriminately attempt to interpret the first column of every line as a float, then ignore the line if that fails. I would guess, looking at your excerpt of VELOC.dat, that a better approach would be to look for a line containing "TS", then take the following c lines, where c is the number of nodes — the same strategy that you use when parsing WSPL.dat.

This program is large enough that it should be split into functions, so that each chunk of code has a name that reflects its purpose, and so that each function has its own local variables.

from itertools import count
def n_lines_after_ts(n, f):
 """
 Filter the file, extracting the next n lines after each occurrence of "TS".
 Each result is a triple consisting of:
 * A sequential sample number (counting up from 0)
 * The node number for that sample (cycling 0, 1, ..., n-1)
 * The line from the file
 """
 line_counter = count()
 for line in f:
 if line.startswith('TS'):
 for node in range(n):
 yield next(line_counter), node, next(f)
def num_nodes(wspl_dat):
 """
 Parse the file to extract the first number that appears after "ND".
 """
 for line in wspl_dat:
 if line.startswith('ND'):
 return int(line.partition('ND')[-1])
def wspl_max_indices(num_nodes, wspl_dat):
 """
 Find the indices of the maximum sample for each node.
 """
 max_samples = [float('-inf')] * num_nodes
 max_indices = [None] * num_nodes
 for line_num, node, sample in n_lines_after_ts(num_nodes, wspl_dat):
 sample = float(sample)
 if sample > max_samples[node]:
 max_samples[node] = sample
 max_indices[node] = line_num
 return max_indices
def extract_veloc(num_nodes, veloc_dat, indices):
 """
 Extract the specified indices from the velocity data.
 """
 set_of_indices = set(indices)
 lines = {
 line_num: sample
 for line_num, _, sample in n_lines_after_ts(num_nodes, veloc_dat)
 if line_num in set_of_indices
 }
 return (lines[i] for i in indices)
with open('WSPL.dat') as wspl_dat_file, \
 open('VELOC.dat') as veloc_dat_file, \
 open('output_veloc_max.dat', 'w') as out:
 num_nodes = num_nodes(wspl_dat_file)
 print('Anzahl der Knoten: {0}'.format(num_nodes))
 max_wspl_indices = wspl_max_indices(num_nodes, wspl_dat_file)
 out.write('VECTOR\nND {0:2d}\nST 0\nTS 0.00\n'.format(num_nodes))
 out.writelines(extract_veloc(num_nodes, veloc_dat_file, max_wspl_indices))
print('Programm ist beendet.')

WOW! I really enjoyed your analyse! I have learned a lot. I have understood but I need more time to internalise it.
I have a small question. I have run the code which you have written for bigger data. It works perfectly, but at the end the velocieties will be sorted and written in the final output. In function "extract_veloc", the indices will be sorted and extracted from VELOC.dat and written in output. But what I need is the velocities respected to the indicies (without sorting). I tryed to take the following line out "indices = set(indices)", but it did not wrok.
My solution doesn't sort anything. extract_veloc() is just a filter. The output will appear in the same order of appearance as the input. indices = set(indices) is just an optimization to use the right data structure so that the in operator is fast.
Thanks for your reply. For example if the indices of the maxima is [15, 16, 17, 18, 5, 8, 9] the maximum velocities will be appear in this order [5, 8, 9, 15, 16, 17, 18]. I think it is because of the if statement, isn't it? If searches in the list and finds the first integer. Am I correct?
I see. I think I had misunderstood the problem. Corrected in Rev 5.

Stack Exchange Network

Matching maximal nodes with velocities

2 Answers 2

Performance

Style

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Matching maximal nodes with velocities

2 Answers 2

Performance

Style

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions