In my opinion, Vectorization operation with numpy should be much faster than use for in pure python. I write two function to get and process data in a csv file, one in numpy and another in pure python, but numpy one takes nearly four times time of the other. Why? Is this the "wrong" way to numpy? Any suggestion would be greatly appreciated!
The python code is below, while csv file in rather long, and I put it to enter link description here
The csv file includes some info about an engine, in which, the first column means crankshaft angle in degrees, and the 8th column means (header "PCYL_1") means the first cylinder pressure in bar.
what I want to do:
- get angle-pressure data pairs with only integer angle,
- group the data by angle, and get the max pressure of each angle
- get new angle-max_pressure data pairs
- shift angle range from -360~359 to 0~719
- sort data-pairs by angle
- because angle range must be 0~720, and first pressure equals last pressure, add a [720.0, first angle] to data pairs
- output data pairs to a dat file
my run eviroment is
- python3.6.4 MSC v.1900 32 bit (Intel)
- win8.1 64 bit
I i run ipython in the script file direcotry and input below:
from gen_cylinder_pressure_data_from_csv import *
In [5]: %timeit main_pure_python()
153 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
In [6]: %timeit main_with_numpy()
627 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
python code is below:
from glob import glob
import numpy
def get_data(filename):
with open(filename, 'r', encoding='utf-8-sig') as f:
headers = None
for line_number, line in enumerate(f.readlines()):
if line_number == 0:
headers = line.strip().split(',')
angle_index = headers.index('曲轴转角')
# cylinder_pressure_indexes = [i for i in range(len(headers)) if headers[i].startswith('PCYL_1')]
cylinder_pressure_indexes = [i for i in range(len(headers)) if headers[i].startswith('PCYL_1')]
elif line_number == 1:
continue
else:
data = line.strip()
if data != '':
datas = data.split(',')
angle = datas[angle_index]
if '.' not in angle:
# cylinder_pressure = max(datas[i] for i in cylinder_pressure_indexes)
cylinder_pressure = datas[cylinder_pressure_indexes[0]]
# if angle == '17':
# print(angle, cylinder_pressure)
yield angle, cylinder_pressure
def write_data(filename):
data_dic = {}
for angle, cylinder_pressure in get_data(filename):
k = int(angle)
v = float(cylinder_pressure)
if k in data_dic:
data_dic[k].append(v)
else:
data_dic[k] = [v]
for k, v in data_dic.items():
# datas_dic[k] = sum(v) / len(v)
data_dic[k] = max(v)
angles = sorted(data_dic.keys())
if angles[-1] - angles[0] != 720:
data_dic[angles[0] + 720] = data_dic[angles[0]]
angles.append(angles[0] + 720)
else:
print(angles[0], angles[-1])
with open('%srpm.dat' % filename[-8:-4], 'w', encoding='utf-8') as f:
for k in angles:
# print('%s,%s\n' % (k,datas_dic[k]))
f.write('%s,%s\n' % (k, data_dic[k]))
def main_with_numpy():
# rather slow than main_pure_python
for filename in glob('Ten*.csv'):
with open(filename, mode='r', encoding='utf-8-sig') as f:
data_array = numpy.loadtxt(f, delimiter=',', usecols=(0, 7), skiprows=2)[::10]
pressure_array = data_array[:, 1]
pressure_array = pressure_array.reshape(720, pressure_array.shape[0] // 720)
pressure_array = numpy.amax(pressure_array, axis=1, keepdims=True)
data_output = numpy.zeros((721, 2), )
data_output[:-1, 0] = data_array[:720, 0]
data_output[:-1, 1] = pressure_array.reshape(720)
data_output[:, 0] = (data_output[:, 0] + 720) % 720
data_output[-1, 0] = 721
data_output = data_output[data_output[:, 0].argsort()]
data_output[-1] = data_output[0]
data_output[-1, 0] = 720.0
with open('%srpm.dat' % filename[-8:-4], 'w', encoding='utf-8') as f:
numpy.savetxt(f, data_output, fmt='%f', delimiter=',')
pass
def main_pure_python():
for filename in glob('Ten*.csv'):
write_data(filename)
pass
if __name__ == '__main__':
main_pure_python()
1 Answer 1
Well I am also newbie on numpy, but your question interested me, so I did some profile check on your code also google some about numpy. Here is what I found
The main reason why you numpy solution is so slow is because of numpy.loadtxt
Profiler Result
Here is the profiler result from your main_with_numpy
function
1562753 function calls (1476352 primitive calls) in 1.624 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.004 0.004 1.624 1.624 gen_cylinder_pressure_data_from_csv.py:55(main_with_numpy)
1 0.032 0.032 1.609 1.609 npyio.py:765(loadtxt)
3 0.430 0.143 1.545 0.515 npyio.py:994(read_data)
86401 0.144 0.000 0.452 0.000 npyio.py:982(split_line)
86400 0.086 0.000 0.316 0.000 npyio.py:1019(<listcomp>)
172800/86400 0.228 0.000 0.243 0.000 npyio.py:966(pack_items)
...
And result from your main_pure_python
function
195793 function calls (195785 primitive calls) in 0.241 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.241 0.241 gen_cylinder_pressure_data_from_csv.py:76(main_pure_python)
1 0.015 0.015 0.240 0.240 gen_cylinder_pressure_data_from_csv.py:31(write_data)
8641 0.078 0.000 0.224 0.000 gen_cylinder_pressure_data_from_csv.py:7(get_data)
86401 0.082 0.000 0.082 0.000 {method 'split' of 'str' objects}
1 0.042 0.042 0.050 0.050 {method 'readlines' of '_io._IOBase' objects}
Almost 8 times slower and check the npyio.py:765(loadtxt)
cost most of the time
You used generator in your main_pure_python
to read data, so to eliminate the effect from loadtxt so I check the part of the function after load data
Here is the result
With numpy
2917 function calls in 0.008 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.008 0.008 gen_cylinder_pressure_data_from_csv.py:81(deal_data_numpy)
1 0.004 0.004 0.006 0.006 npyio.py:1143(savetxt)
Without numpy
9369 function calls in 0.011 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.009 0.009 0.011 0.011 gen_cylinder_pressure_data_from_csv.py:44(deal_data)
7921 0.001 0.000 0.001 0.000 {method 'append' of 'list' objects}
720 0.000 0.000 0.000 0.000 {built-in method builtins.max}
with numpy is slightly faster.
Why numpy.loadtxt
is slow
Sorry I can't help you review your numpy code. But I google about this question why numpy.loadtxt
so slow.
Seriously, stop using the numpy.loadtxt() function (unless you have a lot of spare time...). Why you might ask? - Because it is SLOW! - How slow you might ask? - Very slow! Numpy loads a 250 mb csv-file containing 6215000 x 4 datapoints from my SSD in approx. 35 s!
Another relative links about this problem
So as mentioned in these links pandas
might be better choice for you to read csv file or leave it to just pure python
-
\$\begingroup\$ Thank you very much ! I've never thought of file io as the key! i use profile in ipython and get the similar result, both main_pure_python and main_with_numpy take much time in get data, whie numpy.loadtext even worse. \$\endgroup\$user2458587– user24585872018年11月06日 13:27:27 +00:00Commented Nov 6, 2018 at 13:27
Explore related questions
See similar questions with these tags.
loadtxt
andsavetxt
don't make much use of compilednumpy
. They use python file io. Their performance and code has been discussed on SO many times. \$\endgroup\$