Optimizing a python script with large arrays

Question 1

I have a python script that I'd like to speed up. I'm hoping some of the gurus here can offer some advice.

The script reads in some data from file: 2 arrays with lat/lon vals, 3 arrays with image data and one quality mask. Each array is typically around 22M datapoints. An output array exists and I convert the lat/lon values into array subscripts, mask the image data and then put into the correct locations in the output array.

I've got a feeling that my current method, whilst successful, is far from optimized!

mainarr = np.zeros((4,1800,3600))
#Loop over all input files
for infile in files:
 #Code for reading data goes here. Limited by disk I/O so no optimisation here. 
 #Location arrays are: lats,lons
 #Image data arrays are: c550,c670,c870
 #Quality mask is: cldy
 #Each of the above is a 1d numpy array, all are the same size
 #Function for reading the input data is not available, have called it 'readdata()' here.
 #lats, lons, c550, c670, c870 are all 32 bit floats. cldy is 8bit int.
 lats = np.array(readdata()).ravel()
 lons = np.array(readdata()).ravel()
 c550 = np.array(readdata()).ravel()
 c670 = np.array(readdata()).ravel()
 c870 = np.array(readdata()).ravel()
 cldy = np.array(readdata(),dtype=np.int8).ravel()
 lats = np.array(np.round((lats+90)*10),dtype=np.int16)
 lons = np.array(np.round((lons+180)*10),dtype=np.int16)
 lats[(lats>=1800).nonzero()]=1799
 lons[(lons>=3600).nonzero()]=3599
 #The function below is already optimized
 masker = getmask(cldy, 1, 1, 0)
 pts = (masker != 1).nonzero()
 lats[pts] = 0
 lons[pts] = 0
 mainarr[0,lats,lons] = mainarr[0,lats,lons]+c550
 mainarr[1,lats,lons] = mainarr[1,lats,lons]+c670
 mainarr[2,lats,lons] = mainarr[2,lats,lons]+c870
 mainarr[3,lats,lons] = mainarr[3,lats,lons]+1
 i = i+1

Question 2

I think you need to show the entire code including file read, for possible shortcuts related to building lists and arrays. You could be having a multiple read file issue... But I need to see more code.

Question 3

@holroy Unfortunately I can't post the reading code. As I say, it cannot be optimised further. I have, however, added a few extra lines to the above code that shows how the read data is arranged in the lat/lon/etc arrays. Hopefully it's useful. As I mentioned in the text, the files contain approx. 22 million datapoints each. There are several hundred thousand input files.

Question 4

The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.

Question 5

You can get minor performance improvements if you use the += synthax of numpy, e.g.

mainarr[0, lats, lons] = mainarr[0, lats, lons] + c550

would reduce to

mainarr[0, lats, lons] += c550

Though it is not really in-place, you still gain some speed.

Another idea I had, was to use numpy.clip as it might be faster than setting the upper bound for your indices manually. You may be able to gain further improvements using numpy.put with mode=clip instead of really clipping the values. Positive effects for numpy.take (which is the opposite of numpy.put) are described here.

I put up a little program to test the gain compared to your proposed code. As you did not provide much additional information, I had to do some assumptions, e.g. for the way your data looks like (see mock_data).

The result is something like the following

Experiment (10 files, 10000 points)
Original: 10.093s Refactored: 9.964s
Experiment (10 files, 1000 points)
Original: 2.399s Refactored: 2.362s
Experiment (10 files, 100 points)
Original: 0.349s Refactored: 0.342s
Experiment (100 files, 10000 points)
Original: 47.823s Refactored: 47.672s
Experiment (100 files, 1000 points)
Original: 10.888s Refactored: 10.781s
Experiment (100 files, 100 points)
Original: 3.306s Refactored: 3.195s
Experiment (1000 files, 10000 points)
Original: 423.727s Refactored: 420.922s
Experiment (1000 files, 1000 points)
Original: 58.546s Refactored: 56.579s
Experiment (1000 files, 100 points)
Original: 20.227s Refactored: 18.260s

Conclusion

In my oppinion, you will not be able to gain a great speed-up by fiddling around with numpy here. I made a small test not included in the listing above with 1 file and 1M points

Experiment (1 files, 1000000 points)
Original: 51.245s Refactored: 48.703s

Because of this I think numpy is not really the problem in your case. If your mask hits a lot of points, it might be a good idea to remove the corresponding indices from lats and lons with numpy.delete, because both lists will get shorter and mainarr[0][0] will not be messed up with all the values you put there.

As the operations on the files do not really depend on the other files, one could come up with the idea to split up the process and merge their results later. Your task would be to re-implement your function, so that you can run it in parallel in different processes for parts of your file list. But even this might not help you that much, if your I/O capacity is the bottleneck.

Test script

import numpy as np
import timeit
def mock_data(n_samples=1000):
 """Mock data generation"""
 lats = np.random.uniform(-90.0, 90.0, (n_samples, ))
 lons = np.random.uniform(-180.0, 180.0, (n_samples, ))
 c550 = np.random.rand(n_samples)
 c670 = np.random.rand(n_samples)
 c870 = np.random.rand(n_samples)
 cldy = np.random.rand(n_samples)
 return lats, lons, c550, c670, c870, cldy
def mock_files(n_files=500, n_data=1000):
 """Mock files for benchmarking"""
 np.random.seed(42)
 return [mock_data(n_data) for _ in range(n_files)]
def getmask(cldy, tresh, max_val, min_val):
 """Mock getmask"""
 return cldy > (tresh / 2.0)
def original(files):
 """Original function by @Simon"""
 mainarr = np.zeros((4, 1800, 3600))
 i = 0
 #Loop over all input files
 for infile in files:
 lats, lons, c550, c670, c870, cldy = infile
 lats = np.array(np.round((lats+90)*10), dtype=np.int16)
 lons = np.array(np.round((lons+180)*10), dtype=np.int16)
 lats[(lats>=1800).nonzero()]=1799
 lons[(lons>=3600).nonzero()]=3599
 #The function below is already optimized
 masker = getmask(cldy, 1.0, 1.0, 0.0)
 pts = (masker != 1).nonzero()
 lats[pts] = 0
 lons[pts] = 0
 mainarr[0, lats, lons] = mainarr[0, lats, lons]+c550
 mainarr[1, lats, lons] = mainarr[1, lats, lons]+c670
 mainarr[2, lats, lons] = mainarr[2, lats, lons]+c870
 mainarr[3, lats, lons] = mainarr[3, lats, lons]+1
 i = i+1
 return mainarr
def refactored(files):
 """refactored function by @Alex Vorndran"""
 mainarr = np.zeros((4, 1800, 3600))
 #Loop over all input files
 for i, infile in enumerate(files): 
 lats, lons, c550, c670, c870, cldy = infile
 lats = np.array(np.round((lats+90)*10), dtype=np.int16)
 lons = np.array(np.round((lons+180)*10), dtype=np.int16)
 np.clip(lats, 0, 1799, out=lats)
 np.clip(lons, 0, 3599, out=lons)
 #The function below is already optimized
 masker = getmask(cldy, 1.0, 1.0, 0.0)
 pts = (masker != 1).nonzero()
 lats[pts] = 0
 lons[pts] = 0
 # make use of "in-place" additions
 mainarr[0, lats, lons] += c550
 mainarr[1, lats, lons] += c670
 mainarr[2, lats, lons] += c870
 mainarr[3, lats, lons] += 1.0
 return mainarr
def validate_refactoring(n_files, n_data):
 """Validate the refactoring with comparison by value"""
 files = mock_files(n_files, n_data)
 mainarr_o = original(files)
 mainarr_r = refactored(files)
 return np.allclose(mainarr_o, mainarr_r)
def test_functions():
 """Test different versions"""
 n_runs = 100
 n_files_all = (10, 100, 1000)
 n_data_all = (100, 1000, 10000)
 # a list of tuples where the first element is the number of files
 # and the second one is the number of elements
 files_and_points = [( 10, 10000), ( 10, 1000), ( 10, 100),
 ( 100, 10000), ( 100, 1000), ( 100, 100),
 (1000, 10000), (1000, 1000), (1000, 100)]
 assert validate_refactoring(10, 100)
 imports = 'from __main__ import original, refactored, mock_files;'
 for n_files, n_data in files_and_points:
 data = 'files = mock_files({}, {});'.format(n_files, n_data)
 time_original = timeit.timeit(
 'original(files)', setup=imports+data, number=n_runs)
 time_refactored = timeit.timeit(
 'refactored(files)', setup=imports+data, number=n_runs)
 print('Experiment ({} files, {} points)'.format(n_files, n_data))
 print('Original: {:.3f}s Refactored: {:.3f}s\n'.format(
 time_original, time_refactored))
if __name__ == '__main__':
 test_functions()

Question 6

Thanks for the detailed reply, Alex. That has helped a lot: I didn't know that Python, like C, accepted "+=" when working with arrays. I edited my code in line with your revised code and got quite an impressive speedup (see below, 'new'). I then also added your suggestion for using np.delete and that sped things up even more ('new2' below).

Question 7

Test results: Average time to run code. Running each version of code 5 times with 14 image files, each file containing around 24M pixels. old code: 80.6 seconds. New code: 53.7 seconds. New2 code: 41.85 seconds. So the new2 code runs in nearly half the time of the original code. Excellent!

AlexV AlexV 7,3532 gold badges24 silver badges47 bronze badges · Accepted Answer · 2015-12-19 14:15:21Z

You can get minor performance improvements if you use the += synthax of numpy, e.g.

mainarr[0, lats, lons] = mainarr[0, lats, lons] + c550

would reduce to

mainarr[0, lats, lons] += c550

Though it is not really in-place, you still gain some speed.

Another idea I had, was to use numpy.clip as it might be faster than setting the upper bound for your indices manually. You may be able to gain further improvements using numpy.put with mode=clip instead of really clipping the values. Positive effects for numpy.take (which is the opposite of numpy.put) are described here.

I put up a little program to test the gain compared to your proposed code. As you did not provide much additional information, I had to do some assumptions, e.g. for the way your data looks like (see mock_data).

The result is something like the following

Experiment (10 files, 10000 points)
Original: 10.093s Refactored: 9.964s
Experiment (10 files, 1000 points)
Original: 2.399s Refactored: 2.362s
Experiment (10 files, 100 points)
Original: 0.349s Refactored: 0.342s
Experiment (100 files, 10000 points)
Original: 47.823s Refactored: 47.672s
Experiment (100 files, 1000 points)
Original: 10.888s Refactored: 10.781s
Experiment (100 files, 100 points)
Original: 3.306s Refactored: 3.195s
Experiment (1000 files, 10000 points)
Original: 423.727s Refactored: 420.922s
Experiment (1000 files, 1000 points)
Original: 58.546s Refactored: 56.579s
Experiment (1000 files, 100 points)
Original: 20.227s Refactored: 18.260s

Conclusion

In my oppinion, you will not be able to gain a great speed-up by fiddling around with numpy here. I made a small test not included in the listing above with 1 file and 1M points

Experiment (1 files, 1000000 points)
Original: 51.245s Refactored: 48.703s

Because of this I think numpy is not really the problem in your case. If your mask hits a lot of points, it might be a good idea to remove the corresponding indices from lats and lons with numpy.delete, because both lists will get shorter and mainarr[0][0] will not be messed up with all the values you put there.

As the operations on the files do not really depend on the other files, one could come up with the idea to split up the process and merge their results later. Your task would be to re-implement your function, so that you can run it in parallel in different processes for parts of your file list. But even this might not help you that much, if your I/O capacity is the bottleneck.

Test script

import numpy as np
import timeit
def mock_data(n_samples=1000):
 """Mock data generation"""
 lats = np.random.uniform(-90.0, 90.0, (n_samples, ))
 lons = np.random.uniform(-180.0, 180.0, (n_samples, ))
 c550 = np.random.rand(n_samples)
 c670 = np.random.rand(n_samples)
 c870 = np.random.rand(n_samples)
 cldy = np.random.rand(n_samples)
 return lats, lons, c550, c670, c870, cldy
def mock_files(n_files=500, n_data=1000):
 """Mock files for benchmarking"""
 np.random.seed(42)
 return [mock_data(n_data) for _ in range(n_files)]
def getmask(cldy, tresh, max_val, min_val):
 """Mock getmask"""
 return cldy > (tresh / 2.0)
def original(files):
 """Original function by @Simon"""
 mainarr = np.zeros((4, 1800, 3600))
 i = 0
 #Loop over all input files
 for infile in files:
 lats, lons, c550, c670, c870, cldy = infile
 lats = np.array(np.round((lats+90)*10), dtype=np.int16)
 lons = np.array(np.round((lons+180)*10), dtype=np.int16)
 lats[(lats>=1800).nonzero()]=1799
 lons[(lons>=3600).nonzero()]=3599
 #The function below is already optimized
 masker = getmask(cldy, 1.0, 1.0, 0.0)
 pts = (masker != 1).nonzero()
 lats[pts] = 0
 lons[pts] = 0
 mainarr[0, lats, lons] = mainarr[0, lats, lons]+c550
 mainarr[1, lats, lons] = mainarr[1, lats, lons]+c670
 mainarr[2, lats, lons] = mainarr[2, lats, lons]+c870
 mainarr[3, lats, lons] = mainarr[3, lats, lons]+1
 i = i+1
 return mainarr
def refactored(files):
 """refactored function by @Alex Vorndran"""
 mainarr = np.zeros((4, 1800, 3600))
 #Loop over all input files
 for i, infile in enumerate(files): 
 lats, lons, c550, c670, c870, cldy = infile
 lats = np.array(np.round((lats+90)*10), dtype=np.int16)
 lons = np.array(np.round((lons+180)*10), dtype=np.int16)
 np.clip(lats, 0, 1799, out=lats)
 np.clip(lons, 0, 3599, out=lons)
 #The function below is already optimized
 masker = getmask(cldy, 1.0, 1.0, 0.0)
 pts = (masker != 1).nonzero()
 lats[pts] = 0
 lons[pts] = 0
 # make use of "in-place" additions
 mainarr[0, lats, lons] += c550
 mainarr[1, lats, lons] += c670
 mainarr[2, lats, lons] += c870
 mainarr[3, lats, lons] += 1.0
 return mainarr
def validate_refactoring(n_files, n_data):
 """Validate the refactoring with comparison by value"""
 files = mock_files(n_files, n_data)
 mainarr_o = original(files)
 mainarr_r = refactored(files)
 return np.allclose(mainarr_o, mainarr_r)
def test_functions():
 """Test different versions"""
 n_runs = 100
 n_files_all = (10, 100, 1000)
 n_data_all = (100, 1000, 10000)
 # a list of tuples where the first element is the number of files
 # and the second one is the number of elements
 files_and_points = [( 10, 10000), ( 10, 1000), ( 10, 100),
 ( 100, 10000), ( 100, 1000), ( 100, 100),
 (1000, 10000), (1000, 1000), (1000, 100)]
 assert validate_refactoring(10, 100)
 imports = 'from __main__ import original, refactored, mock_files;'
 for n_files, n_data in files_and_points:
 data = 'files = mock_files({}, {});'.format(n_files, n_data)
 time_original = timeit.timeit(
 'original(files)', setup=imports+data, number=n_runs)
 time_refactored = timeit.timeit(
 'refactored(files)', setup=imports+data, number=n_runs)
 print('Experiment ({} files, {} points)'.format(n_files, n_data))
 print('Original: {:.3f}s Refactored: {:.3f}s\n'.format(
 time_original, time_refactored))
if __name__ == '__main__':
 test_functions()

Thanks for the detailed reply, Alex. That has helped a lot: I didn't know that Python, like C, accepted "+=" when working with arrays. I edited my code in line with your revised code and got quite an impressive speedup (see below, 'new'). I then also added your suggestion for using np.delete and that sped things up even more ('new2' below).
Test results: Average time to run code. Running each version of code 5 times with 14 image files, each file containing around 24M pixels. old code: 80.6 seconds. New code: 53.7 seconds. New2 code: 41.85 seconds. So the new2 code runs in nearly half the time of the original code. Excellent!

Stack Exchange Network

Optimizing a python script with large arrays

1 Answer 1

Conclusion

Test script

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimizing a python script with large arrays

1 Answer 1

Conclusion

Test script

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions