process data with numpy seems rather slow than pure python?

Question 1

In my opinion, Vectorization operation with numpy should be much faster than use for in pure python. I write two function to get and process data in a csv file, one in numpy and another in pure python, but numpy one takes nearly four times time of the other. Why? Is this the "wrong" way to numpy? Any suggestion would be greatly appreciated!

The python code is below, while csv file in rather long, and I put it to enter link description here

The csv file includes some info about an engine, in which, the first column means crankshaft angle in degrees, and the 8th column means (header "PCYL_1") means the first cylinder pressure in bar.

what I want to do:

get angle-pressure data pairs with only integer angle,
group the data by angle, and get the max pressure of each angle
get new angle-max_pressure data pairs
shift angle range from -360~359 to 0~719
sort data-pairs by angle
because angle range must be 0~720, and first pressure equals last pressure, add a [720.0, first angle] to data pairs
output data pairs to a dat file

my run eviroment is

python3.6.4 MSC v.1900 32 bit (Intel)
win8.1 64 bit

I i run ipython in the script file direcotry and input below:

from gen_cylinder_pressure_data_from_csv import *
In [5]: %timeit main_pure_python()
153 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
In [6]: %timeit main_with_numpy()
627 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

python code is below:

from glob import glob
import numpy
def get_data(filename):
 with open(filename, 'r', encoding='utf-8-sig') as f:
 headers = None
 for line_number, line in enumerate(f.readlines()):
 if line_number == 0:
 headers = line.strip().split(',')
 angle_index = headers.index('曲轴转角')
 # cylinder_pressure_indexes = [i for i in range(len(headers)) if headers[i].startswith('PCYL_1')]
 cylinder_pressure_indexes = [i for i in range(len(headers)) if headers[i].startswith('PCYL_1')]
 elif line_number == 1:
 continue
 else:
 data = line.strip()
 if data != '':
 datas = data.split(',')
 angle = datas[angle_index]
 if '.' not in angle:
 # cylinder_pressure = max(datas[i] for i in cylinder_pressure_indexes)
 cylinder_pressure = datas[cylinder_pressure_indexes[0]]
 # if angle == '17':
 # print(angle, cylinder_pressure)
 yield angle, cylinder_pressure
def write_data(filename):
 data_dic = {}
 for angle, cylinder_pressure in get_data(filename):
 k = int(angle)
 v = float(cylinder_pressure)
 if k in data_dic:
 data_dic[k].append(v)
 else:
 data_dic[k] = [v]
 for k, v in data_dic.items():
 # datas_dic[k] = sum(v) / len(v)
 data_dic[k] = max(v)
 angles = sorted(data_dic.keys())
 if angles[-1] - angles[0] != 720:
 data_dic[angles[0] + 720] = data_dic[angles[0]]
 angles.append(angles[0] + 720)
 else:
 print(angles[0], angles[-1])
 with open('%srpm.dat' % filename[-8:-4], 'w', encoding='utf-8') as f:
 for k in angles:
 # print('%s,%s\n' % (k,datas_dic[k]))
 f.write('%s,%s\n' % (k, data_dic[k]))
def main_with_numpy():
 # rather slow than main_pure_python
 for filename in glob('Ten*.csv'):
 with open(filename, mode='r', encoding='utf-8-sig') as f:
 data_array = numpy.loadtxt(f, delimiter=',', usecols=(0, 7), skiprows=2)[::10]
 pressure_array = data_array[:, 1]
 pressure_array = pressure_array.reshape(720, pressure_array.shape[0] // 720)
 pressure_array = numpy.amax(pressure_array, axis=1, keepdims=True)
 data_output = numpy.zeros((721, 2), )
 data_output[:-1, 0] = data_array[:720, 0]
 data_output[:-1, 1] = pressure_array.reshape(720)
 data_output[:, 0] = (data_output[:, 0] + 720) % 720
 data_output[-1, 0] = 721
 data_output = data_output[data_output[:, 0].argsort()]
 data_output[-1] = data_output[0]
 data_output[-1, 0] = 720.0
 with open('%srpm.dat' % filename[-8:-4], 'w', encoding='utf-8') as f:
 numpy.savetxt(f, data_output, fmt='%f', delimiter=',')
 pass
def main_pure_python():
 for filename in glob('Ten*.csv'):
 write_data(filename)
 pass
if __name__ == '__main__':
 main_pure_python()

Question 2

We need to know what your program is doing. Could you explain these calculations etc?

Question 3

thank you for replying, i've edit the content and list what i want to do. @t3chb0t

Question 4

ok, grea; I'd be even better if you also could add your measurements because you're saying that one method is faster than the other... is the difference in seconds or minutes? Oh, and the python version tag is what we need to.

Question 5

measurements and python version added. @t3chb0t

Question 6

loadtxt and savetxt don't make much use of compiled numpy. They use python file io. Their performance and code has been discussed on SO many times.

Question 7

Well I am also newbie on numpy, but your question interested me, so I did some profile check on your code also google some about numpy. Here is what I found

The main reason why you numpy solution is so slow is because of numpy.loadtxt

Profiler Result

Here is the profiler result from your main_with_numpy function

1562753 function calls (1476352 primitive calls) in 1.624 seconds
 Ordered by: cumulative time
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.004 0.004 1.624 1.624 gen_cylinder_pressure_data_from_csv.py:55(main_with_numpy)
 1 0.032 0.032 1.609 1.609 npyio.py:765(loadtxt)
 3 0.430 0.143 1.545 0.515 npyio.py:994(read_data)
 86401 0.144 0.000 0.452 0.000 npyio.py:982(split_line)
 86400 0.086 0.000 0.316 0.000 npyio.py:1019(<listcomp>)
172800/86400 0.228 0.000 0.243 0.000 npyio.py:966(pack_items)
...

And result from your main_pure_python function

195793 function calls (195785 primitive calls) in 0.241 seconds
 Ordered by: cumulative time
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.000 0.000 0.241 0.241 gen_cylinder_pressure_data_from_csv.py:76(main_pure_python)
 1 0.015 0.015 0.240 0.240 gen_cylinder_pressure_data_from_csv.py:31(write_data)
 8641 0.078 0.000 0.224 0.000 gen_cylinder_pressure_data_from_csv.py:7(get_data)
 86401 0.082 0.000 0.082 0.000 {method 'split' of 'str' objects}
 1 0.042 0.042 0.050 0.050 {method 'readlines' of '_io._IOBase' objects}

Almost 8 times slower and check the npyio.py:765(loadtxt) cost most of the time

You used generator in your main_pure_python to read data, so to eliminate the effect from loadtxt so I check the part of the function after load data

Here is the result

With numpy

2917 function calls in 0.008 seconds
 Ordered by: cumulative time
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.000 0.000 0.008 0.008 gen_cylinder_pressure_data_from_csv.py:81(deal_data_numpy)
 1 0.004 0.004 0.006 0.006 npyio.py:1143(savetxt)

Without numpy

9369 function calls in 0.011 seconds
 Ordered by: cumulative time
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.009 0.009 0.011 0.011 gen_cylinder_pressure_data_from_csv.py:44(deal_data)
 7921 0.001 0.000 0.001 0.000 {method 'append' of 'list' objects}
 720 0.000 0.000 0.000 0.000 {built-in method builtins.max}

with numpy is slightly faster.

Why `numpy.loadtxt` is slow

Sorry I can't help you review your numpy code. But I google about this question why numpy.loadtxt so slow.

Here is the original link

Seriously, stop using the numpy.loadtxt() function (unless you have a lot of spare time...). Why you might ask? - Because it is SLOW! - How slow you might ask? - Very slow! Numpy loads a 250 mb csv-file containing 6215000 x 4 datapoints from my SSD in approx. 35 s!

Another relative links about this problem

So as mentioned in these links pandas might be better choice for you to read csv file or leave it to just pure python

Question 8

Thank you very much ! I've never thought of file io as the key! i use profile in ipython and get the similar result, both main_pure_python and main_with_numpy take much time in get data, whie numpy.loadtext even worse.

Aries_is_there Aries_is_there 9945 silver badges14 bronze badges · Accepted Answer · 2018-11-06 11:47:51Z

Well I am also newbie on numpy, but your question interested me, so I did some profile check on your code also google some about numpy. Here is what I found

The main reason why you numpy solution is so slow is because of numpy.loadtxt

Profiler Result

Here is the profiler result from your main_with_numpy function

1562753 function calls (1476352 primitive calls) in 1.624 seconds
 Ordered by: cumulative time
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.004 0.004 1.624 1.624 gen_cylinder_pressure_data_from_csv.py:55(main_with_numpy)
 1 0.032 0.032 1.609 1.609 npyio.py:765(loadtxt)
 3 0.430 0.143 1.545 0.515 npyio.py:994(read_data)
 86401 0.144 0.000 0.452 0.000 npyio.py:982(split_line)
 86400 0.086 0.000 0.316 0.000 npyio.py:1019(<listcomp>)
172800/86400 0.228 0.000 0.243 0.000 npyio.py:966(pack_items)
...

And result from your main_pure_python function

195793 function calls (195785 primitive calls) in 0.241 seconds
 Ordered by: cumulative time
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.000 0.000 0.241 0.241 gen_cylinder_pressure_data_from_csv.py:76(main_pure_python)
 1 0.015 0.015 0.240 0.240 gen_cylinder_pressure_data_from_csv.py:31(write_data)
 8641 0.078 0.000 0.224 0.000 gen_cylinder_pressure_data_from_csv.py:7(get_data)
 86401 0.082 0.000 0.082 0.000 {method 'split' of 'str' objects}
 1 0.042 0.042 0.050 0.050 {method 'readlines' of '_io._IOBase' objects}

Almost 8 times slower and check the npyio.py:765(loadtxt) cost most of the time

You used generator in your main_pure_python to read data, so to eliminate the effect from loadtxt so I check the part of the function after load data

Here is the result

With numpy

2917 function calls in 0.008 seconds
 Ordered by: cumulative time
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.000 0.000 0.008 0.008 gen_cylinder_pressure_data_from_csv.py:81(deal_data_numpy)
 1 0.004 0.004 0.006 0.006 npyio.py:1143(savetxt)

Without numpy

9369 function calls in 0.011 seconds
 Ordered by: cumulative time
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.009 0.009 0.011 0.011 gen_cylinder_pressure_data_from_csv.py:44(deal_data)
 7921 0.001 0.000 0.001 0.000 {method 'append' of 'list' objects}
 720 0.000 0.000 0.000 0.000 {built-in method builtins.max}

with numpy is slightly faster.

Why `numpy.loadtxt` is slow

Sorry I can't help you review your numpy code. But I google about this question why numpy.loadtxt so slow.

Here is the original link

Seriously, stop using the numpy.loadtxt() function (unless you have a lot of spare time...). Why you might ask? - Because it is SLOW! - How slow you might ask? - Very slow! Numpy loads a 250 mb csv-file containing 6215000 x 4 datapoints from my SSD in approx. 35 s!

Another relative links about this problem

So as mentioned in these links pandas might be better choice for you to read csv file or leave it to just pure python

Thank you very much ! I've never thought of file io as the key! i use profile in ipython and get the similar result, both main_pure_python and main_with_numpy take much time in get data, whie numpy.loadtext even worse.

Stack Exchange Network

process data with numpy seems rather slow than pure python?

1 Answer 1

Profiler Result

Why `numpy.loadtxt` is slow

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

process data with numpy seems rather slow than pure python?

1 Answer 1

Profiler Result

Why numpy.loadtxt is slow

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Why `numpy.loadtxt` is slow