I have a python script that I'd like to speed up. I'm hoping some of the gurus here can offer some advice.
The script reads in some data from file: 2 arrays with lat/lon vals, 3 arrays with image data and one quality mask. Each array is typically around 22M datapoints. An output array exists and I convert the lat/lon values into array subscripts, mask the image data and then put into the correct locations in the output array.
I've got a feeling that my current method, whilst successful, is far from optimized!
mainarr = np.zeros((4,1800,3600))
#Loop over all input files
for infile in files:
#Code for reading data goes here. Limited by disk I/O so no optimisation here.
#Location arrays are: lats,lons
#Image data arrays are: c550,c670,c870
#Quality mask is: cldy
#Each of the above is a 1d numpy array, all are the same size
#Function for reading the input data is not available, have called it 'readdata()' here.
#lats, lons, c550, c670, c870 are all 32 bit floats. cldy is 8bit int.
lats = np.array(readdata()).ravel()
lons = np.array(readdata()).ravel()
c550 = np.array(readdata()).ravel()
c670 = np.array(readdata()).ravel()
c870 = np.array(readdata()).ravel()
cldy = np.array(readdata(),dtype=np.int8).ravel()
lats = np.array(np.round((lats+90)*10),dtype=np.int16)
lons = np.array(np.round((lons+180)*10),dtype=np.int16)
lats[(lats>=1800).nonzero()]=1799
lons[(lons>=3600).nonzero()]=3599
#The function below is already optimized
masker = getmask(cldy, 1, 1, 0)
pts = (masker != 1).nonzero()
lats[pts] = 0
lons[pts] = 0
mainarr[0,lats,lons] = mainarr[0,lats,lons]+c550
mainarr[1,lats,lons] = mainarr[1,lats,lons]+c670
mainarr[2,lats,lons] = mainarr[2,lats,lons]+c870
mainarr[3,lats,lons] = mainarr[3,lats,lons]+1
i = i+1
-
7\$\begingroup\$ I think you need to show the entire code including file read, for possible shortcuts related to building lists and arrays. You could be having a multiple read file issue... But I need to see more code. \$\endgroup\$holroy– holroy2015年12月16日 15:53:30 +00:00Commented Dec 16, 2015 at 15:53
-
\$\begingroup\$ @holroy Unfortunately I can't post the reading code. As I say, it cannot be optimised further. I have, however, added a few extra lines to the above code that shows how the read data is arranged in the lat/lon/etc arrays. Hopefully it's useful. As I mentioned in the text, the files contain approx. 22 million datapoints each. There are several hundred thousand input files. \$\endgroup\$os1– os12015年12月19日 14:12:39 +00:00Commented Dec 19, 2015 at 14:12
-
\$\begingroup\$ The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles. \$\endgroup\$Toby Speight– Toby Speight2019年09月02日 08:02:17 +00:00Commented Sep 2, 2019 at 8:02
1 Answer 1
You can get minor performance improvements if you use the +=
synthax of numpy, e.g.
mainarr[0, lats, lons] = mainarr[0, lats, lons] + c550
would reduce to
mainarr[0, lats, lons] += c550
Though it is not really in-place, you still gain some speed.
Another idea I had, was to use numpy.clip
as it might be faster than setting the upper bound for your indices manually. You may be able to gain further improvements using numpy.put
with mode=clip
instead of really clipping the values. Positive effects for numpy.take
(which is the opposite of numpy.put
) are described here.
I put up a little program to test the gain compared to your proposed code. As you did not provide much additional information, I had to do some assumptions, e.g. for the way your data looks like (see mock_data
).
The result is something like the following
Experiment (10 files, 10000 points)
Original: 10.093s Refactored: 9.964s
Experiment (10 files, 1000 points)
Original: 2.399s Refactored: 2.362s
Experiment (10 files, 100 points)
Original: 0.349s Refactored: 0.342s
Experiment (100 files, 10000 points)
Original: 47.823s Refactored: 47.672s
Experiment (100 files, 1000 points)
Original: 10.888s Refactored: 10.781s
Experiment (100 files, 100 points)
Original: 3.306s Refactored: 3.195s
Experiment (1000 files, 10000 points)
Original: 423.727s Refactored: 420.922s
Experiment (1000 files, 1000 points)
Original: 58.546s Refactored: 56.579s
Experiment (1000 files, 100 points)
Original: 20.227s Refactored: 18.260s
Conclusion
In my oppinion, you will not be able to gain a great speed-up by fiddling around with numpy here. I made a small test not included in the listing above with 1 file and 1M points
Experiment (1 files, 1000000 points)
Original: 51.245s Refactored: 48.703s
Because of this I think numpy is not really the problem in your case. If your mask hits a lot of points, it might be a good idea to remove the corresponding indices from lats
and lons
with numpy.delete
, because both lists will get shorter and mainarr[0][0]
will not be messed up with all the values you put there.
As the operations on the files do not really depend on the other files, one could come up with the idea to split up the process and merge their results later. Your task would be to re-implement your function, so that you can run it in parallel in different processes for parts of your file list. But even this might not help you that much, if your I/O capacity is the bottleneck.
Test script
import numpy as np
import timeit
def mock_data(n_samples=1000):
"""Mock data generation"""
lats = np.random.uniform(-90.0, 90.0, (n_samples, ))
lons = np.random.uniform(-180.0, 180.0, (n_samples, ))
c550 = np.random.rand(n_samples)
c670 = np.random.rand(n_samples)
c870 = np.random.rand(n_samples)
cldy = np.random.rand(n_samples)
return lats, lons, c550, c670, c870, cldy
def mock_files(n_files=500, n_data=1000):
"""Mock files for benchmarking"""
np.random.seed(42)
return [mock_data(n_data) for _ in range(n_files)]
def getmask(cldy, tresh, max_val, min_val):
"""Mock getmask"""
return cldy > (tresh / 2.0)
def original(files):
"""Original function by @Simon"""
mainarr = np.zeros((4, 1800, 3600))
i = 0
#Loop over all input files
for infile in files:
lats, lons, c550, c670, c870, cldy = infile
lats = np.array(np.round((lats+90)*10), dtype=np.int16)
lons = np.array(np.round((lons+180)*10), dtype=np.int16)
lats[(lats>=1800).nonzero()]=1799
lons[(lons>=3600).nonzero()]=3599
#The function below is already optimized
masker = getmask(cldy, 1.0, 1.0, 0.0)
pts = (masker != 1).nonzero()
lats[pts] = 0
lons[pts] = 0
mainarr[0, lats, lons] = mainarr[0, lats, lons]+c550
mainarr[1, lats, lons] = mainarr[1, lats, lons]+c670
mainarr[2, lats, lons] = mainarr[2, lats, lons]+c870
mainarr[3, lats, lons] = mainarr[3, lats, lons]+1
i = i+1
return mainarr
def refactored(files):
"""refactored function by @Alex Vorndran"""
mainarr = np.zeros((4, 1800, 3600))
#Loop over all input files
for i, infile in enumerate(files):
lats, lons, c550, c670, c870, cldy = infile
lats = np.array(np.round((lats+90)*10), dtype=np.int16)
lons = np.array(np.round((lons+180)*10), dtype=np.int16)
np.clip(lats, 0, 1799, out=lats)
np.clip(lons, 0, 3599, out=lons)
#The function below is already optimized
masker = getmask(cldy, 1.0, 1.0, 0.0)
pts = (masker != 1).nonzero()
lats[pts] = 0
lons[pts] = 0
# make use of "in-place" additions
mainarr[0, lats, lons] += c550
mainarr[1, lats, lons] += c670
mainarr[2, lats, lons] += c870
mainarr[3, lats, lons] += 1.0
return mainarr
def validate_refactoring(n_files, n_data):
"""Validate the refactoring with comparison by value"""
files = mock_files(n_files, n_data)
mainarr_o = original(files)
mainarr_r = refactored(files)
return np.allclose(mainarr_o, mainarr_r)
def test_functions():
"""Test different versions"""
n_runs = 100
n_files_all = (10, 100, 1000)
n_data_all = (100, 1000, 10000)
# a list of tuples where the first element is the number of files
# and the second one is the number of elements
files_and_points = [( 10, 10000), ( 10, 1000), ( 10, 100),
( 100, 10000), ( 100, 1000), ( 100, 100),
(1000, 10000), (1000, 1000), (1000, 100)]
assert validate_refactoring(10, 100)
imports = 'from __main__ import original, refactored, mock_files;'
for n_files, n_data in files_and_points:
data = 'files = mock_files({}, {});'.format(n_files, n_data)
time_original = timeit.timeit(
'original(files)', setup=imports+data, number=n_runs)
time_refactored = timeit.timeit(
'refactored(files)', setup=imports+data, number=n_runs)
print('Experiment ({} files, {} points)'.format(n_files, n_data))
print('Original: {:.3f}s Refactored: {:.3f}s\n'.format(
time_original, time_refactored))
if __name__ == '__main__':
test_functions()
-
\$\begingroup\$ Thanks for the detailed reply, Alex. That has helped a lot: I didn't know that Python, like C, accepted "+=" when working with arrays. I edited my code in line with your revised code and got quite an impressive speedup (see below, 'new'). I then also added your suggestion for using np.delete and that sped things up even more ('new2' below). \$\endgroup\$os1– os12015年12月19日 14:37:55 +00:00Commented Dec 19, 2015 at 14:37
-
\$\begingroup\$ Test results: Average time to run code. Running each version of code 5 times with 14 image files, each file containing around 24M pixels. old code: 80.6 seconds. New code: 53.7 seconds. New2 code: 41.85 seconds. So the new2 code runs in nearly half the time of the original code. Excellent! \$\endgroup\$os1– os12015年12月19日 14:39:28 +00:00Commented Dec 19, 2015 at 14:39
Explore related questions
See similar questions with these tags.