5
\$\begingroup\$

In my opinion, Vectorization operation with numpy should be much faster than use for in pure python. I write two function to get and process data in a csv file, one in numpy and another in pure python, but numpy one takes nearly four times time of the other. Why? Is this the "wrong" way to numpy? Any suggestion would be greatly appreciated!

The python code is below, while csv file in rather long, and I put it to enter link description here

The csv file includes some info about an engine, in which, the first column means crankshaft angle in degrees, and the 8th column means (header "PCYL_1") means the first cylinder pressure in bar.

what I want to do:

  1. get angle-pressure data pairs with only integer angle,
  2. group the data by angle, and get the max pressure of each angle
  3. get new angle-max_pressure data pairs
  4. shift angle range from -360~359 to 0~719
  5. sort data-pairs by angle
  6. because angle range must be 0~720, and first pressure equals last pressure, add a [720.0, first angle] to data pairs
  7. output data pairs to a dat file

my run eviroment is

  1. python3.6.4 MSC v.1900 32 bit (Intel)
  2. win8.1 64 bit

I i run ipython in the script file direcotry and input below:

from gen_cylinder_pressure_data_from_csv import *
In [5]: %timeit main_pure_python()
153 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
In [6]: %timeit main_with_numpy()
627 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

python code is below:

from glob import glob
import numpy
def get_data(filename):
 with open(filename, 'r', encoding='utf-8-sig') as f:
 headers = None
 for line_number, line in enumerate(f.readlines()):
 if line_number == 0:
 headers = line.strip().split(',')
 angle_index = headers.index('曲轴转角')
 # cylinder_pressure_indexes = [i for i in range(len(headers)) if headers[i].startswith('PCYL_1')]
 cylinder_pressure_indexes = [i for i in range(len(headers)) if headers[i].startswith('PCYL_1')]
 elif line_number == 1:
 continue
 else:
 data = line.strip()
 if data != '':
 datas = data.split(',')
 angle = datas[angle_index]
 if '.' not in angle:
 # cylinder_pressure = max(datas[i] for i in cylinder_pressure_indexes)
 cylinder_pressure = datas[cylinder_pressure_indexes[0]]
 # if angle == '17':
 # print(angle, cylinder_pressure)
 yield angle, cylinder_pressure
def write_data(filename):
 data_dic = {}
 for angle, cylinder_pressure in get_data(filename):
 k = int(angle)
 v = float(cylinder_pressure)
 if k in data_dic:
 data_dic[k].append(v)
 else:
 data_dic[k] = [v]
 for k, v in data_dic.items():
 # datas_dic[k] = sum(v) / len(v)
 data_dic[k] = max(v)
 angles = sorted(data_dic.keys())
 if angles[-1] - angles[0] != 720:
 data_dic[angles[0] + 720] = data_dic[angles[0]]
 angles.append(angles[0] + 720)
 else:
 print(angles[0], angles[-1])
 with open('%srpm.dat' % filename[-8:-4], 'w', encoding='utf-8') as f:
 for k in angles:
 # print('%s,%s\n' % (k,datas_dic[k]))
 f.write('%s,%s\n' % (k, data_dic[k]))
def main_with_numpy():
 # rather slow than main_pure_python
 for filename in glob('Ten*.csv'):
 with open(filename, mode='r', encoding='utf-8-sig') as f:
 data_array = numpy.loadtxt(f, delimiter=',', usecols=(0, 7), skiprows=2)[::10]
 pressure_array = data_array[:, 1]
 pressure_array = pressure_array.reshape(720, pressure_array.shape[0] // 720)
 pressure_array = numpy.amax(pressure_array, axis=1, keepdims=True)
 data_output = numpy.zeros((721, 2), )
 data_output[:-1, 0] = data_array[:720, 0]
 data_output[:-1, 1] = pressure_array.reshape(720)
 data_output[:, 0] = (data_output[:, 0] + 720) % 720
 data_output[-1, 0] = 721
 data_output = data_output[data_output[:, 0].argsort()]
 data_output[-1] = data_output[0]
 data_output[-1, 0] = 720.0
 with open('%srpm.dat' % filename[-8:-4], 'w', encoding='utf-8') as f:
 numpy.savetxt(f, data_output, fmt='%f', delimiter=',')
 pass
def main_pure_python():
 for filename in glob('Ten*.csv'):
 write_data(filename)
 pass
if __name__ == '__main__':
 main_pure_python()
t3chb0t
44.6k9 gold badges84 silver badges190 bronze badges
asked Nov 6, 2018 at 7:27
\$\endgroup\$
5
  • \$\begingroup\$ We need to know what your program is doing. Could you explain these calculations etc? \$\endgroup\$ Commented Nov 6, 2018 at 8:38
  • \$\begingroup\$ thank you for replying, i've edit the content and list what i want to do. @t3chb0t \$\endgroup\$ Commented Nov 6, 2018 at 9:12
  • \$\begingroup\$ ok, grea; I'd be even better if you also could add your measurements because you're saying that one method is faster than the other... is the difference in seconds or minutes? Oh, and the python version tag is what we need to. \$\endgroup\$ Commented Nov 6, 2018 at 9:15
  • 1
    \$\begingroup\$ measurements and python version added. @t3chb0t \$\endgroup\$ Commented Nov 6, 2018 at 13:09
  • \$\begingroup\$ loadtxt and savetxt don't make much use of compiled numpy. They use python file io. Their performance and code has been discussed on SO many times. \$\endgroup\$ Commented Nov 7, 2018 at 3:19

1 Answer 1

4
\$\begingroup\$

Well I am also newbie on numpy, but your question interested me, so I did some profile check on your code also google some about numpy. Here is what I found

The main reason why you numpy solution is so slow is because of numpy.loadtxt


Profiler Result

Here is the profiler result from your main_with_numpy function

1562753 function calls (1476352 primitive calls) in 1.624 seconds
 Ordered by: cumulative time
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.004 0.004 1.624 1.624 gen_cylinder_pressure_data_from_csv.py:55(main_with_numpy)
 1 0.032 0.032 1.609 1.609 npyio.py:765(loadtxt)
 3 0.430 0.143 1.545 0.515 npyio.py:994(read_data)
 86401 0.144 0.000 0.452 0.000 npyio.py:982(split_line)
 86400 0.086 0.000 0.316 0.000 npyio.py:1019(<listcomp>)
172800/86400 0.228 0.000 0.243 0.000 npyio.py:966(pack_items)
...

And result from your main_pure_python function

195793 function calls (195785 primitive calls) in 0.241 seconds
 Ordered by: cumulative time
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.000 0.000 0.241 0.241 gen_cylinder_pressure_data_from_csv.py:76(main_pure_python)
 1 0.015 0.015 0.240 0.240 gen_cylinder_pressure_data_from_csv.py:31(write_data)
 8641 0.078 0.000 0.224 0.000 gen_cylinder_pressure_data_from_csv.py:7(get_data)
 86401 0.082 0.000 0.082 0.000 {method 'split' of 'str' objects}
 1 0.042 0.042 0.050 0.050 {method 'readlines' of '_io._IOBase' objects}

Almost 8 times slower and check the npyio.py:765(loadtxt) cost most of the time

You used generator in your main_pure_python to read data, so to eliminate the effect from loadtxt so I check the part of the function after load data

Here is the result

With numpy

2917 function calls in 0.008 seconds
 Ordered by: cumulative time
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.000 0.000 0.008 0.008 gen_cylinder_pressure_data_from_csv.py:81(deal_data_numpy)
 1 0.004 0.004 0.006 0.006 npyio.py:1143(savetxt)

Without numpy

9369 function calls in 0.011 seconds
 Ordered by: cumulative time
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.009 0.009 0.011 0.011 gen_cylinder_pressure_data_from_csv.py:44(deal_data)
 7921 0.001 0.000 0.001 0.000 {method 'append' of 'list' objects}
 720 0.000 0.000 0.000 0.000 {built-in method builtins.max}

with numpy is slightly faster.


Why numpy.loadtxt is slow

Sorry I can't help you review your numpy code. But I google about this question why numpy.loadtxt so slow.

Here is the original link

Seriously, stop using the numpy.loadtxt() function (unless you have a lot of spare time...). Why you might ask? - Because it is SLOW! - How slow you might ask? - Very slow! Numpy loads a 250 mb csv-file containing 6215000 x 4 datapoints from my SSD in approx. 35 s!

Another relative links about this problem

So as mentioned in these links pandas might be better choice for you to read csv file or leave it to just pure python

answered Nov 6, 2018 at 11:47
\$\endgroup\$
1
  • \$\begingroup\$ Thank you very much ! I've never thought of file io as the key! i use profile in ipython and get the similar result, both main_pure_python and main_with_numpy take much time in get data, whie numpy.loadtext even worse. \$\endgroup\$ Commented Nov 6, 2018 at 13:27

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.