Parallelize loops using OpenCL in Python

Question 1

I have a given dataset in the matrix y and I want to train different SOMs with it. The SOM is one-dimensional (a line), and its number of neurons varies. I train a SOM of size N=2 at first, and N=NMax at last, giving a total of NMax-2+1 SOMs. For each SOM, I want to store the weights once the training is over before moving on to the next SOM.

The whole point of using PyOpenCL here is that each one of the outer loops is independent of the others. Namely, for each value of N, the script doesn't care about what happens when N takes other values. One could have the same result running the script NMax-2+1 times changing the value of N manually.

With this in mind, I was hoping to be able to perform each one of these independent iterations at the same time using the GPU, so that the time spent reduces significantly. The increase in speed will be less than 1/(NMax-2+1) though, because each iteration is more expensive that the previous ones, as for larger values of N, more calculations are made.

Is there a way to 'translate' this code to run on the GPU? I've never used OpenCL before, so let me know if this is too broad or silly so I can ask a more specific question. The code is self-contained, so feel free to try it out.The four constants declared at the beginning can be changed to whatever you like (given that NMax > 1 and all the others are strictly positive).

import numpy as np
import time
m = 3 # Dimension of datapoints
num_points = 2000 # Number of datapoints
iterMax = 150 # Maximum number of iterations
NMax = 3 # Maximum number of neurons
#%%
np.random.seed(0)
y = np.random.rand(num_points,m) # Generate always the same dataset
sigma_0 = 5 # Initial value of width of the neighborhood function
eta_0 = 1 # Initial value of learning rate
w = list(range(NMax - 1))
wClusters = np.zeros((np.size(y,axis = 0),NMax - 1)) # Clusters for each N
t_begin = time.clock() # Start time
for N in range(NMax-1): # Number of neurons for this iteration
 w[N] = np.random.uniform(0,1,(N+2,np.size(y,axis=1))) - 0.5 # Initialize weights
 iterCount = 1
 while iterCount < iterMax:
 # Mix up the input patterns
 mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
 # Sigma reduction
 sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
 s2 = 2*sigma**2
 # Learning rate reduction
 eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
 for selectedInput in mixInputs: # Pick up one pattern
 # Search winning neuron
 aux = np.sum((selectedInput - w[N])**2, axis = -1)
 ii = np.argmin(aux) # Neuron 'ii' is the winner
 jjs = abs(ii - list(range(N+2)))
 dists = np.min(np.vstack([jjs , abs(jjs-(N+2))]), axis = 0)
 # Update weights
 w[N] = w[N] + eta * np.exp((-dists**2)/s2).T[:,np.newaxis] * (selectedInput - w[N])
 print(N+2,iterCount)
 iterCount += 1 
 # Assign each datapoint to its nearest neuron
 for kk in range(np.size(y,axis = 0)):
 aux = np.sum((y[kk,] - w[N])**2,axis=-1)
 ii = np.argmin(aux) # Neuron 'ii' is the winner
 wClusters[kk,N] = ii + 1
t_end = time.clock() # End time
#%%
print(t_end - t_begin)

Question 2

I'm trying to give a somewhat complete answer.

First of all:

Can this code be adapted to be run on the GPU using (py)OpenCL?

Most probably yes.

Can this been done automatically?

No (afaik).

Most of the questions I get about OpenCL are along the lines of: "Is it worth porting this piece of code to OpenCL for a speedup gain?" You are stating, that your outer loop is independent on the results of other runs, which makes the code basically parallelizable. In a straightforward implementation, each OpenCL working element would execute the same code with slightly different input parameters. Not regarding overhead by data transfer between host and device, the running time of this approach would be equal to the running time of the slowest iteration. Depending on the iterations in your outer loop, this could be a massive speed gain. As long as the numbers stay relatively small, you could try the multiprocessing module in python to parallelize these iterations on the CPU instead of the GPU.

Porting to the GPU usually only makes sense, if a huge number of processes are to be run in parallel (about 1000 or more). So in your case, if you really want an enormous speed boost, see if you can parallelize all calculations inside the loop. For example, you have 150 iterations and 2000 data points. If you could somehow parallelize these 2000 data points, this could offer a much bigger speed gain, which could justify the work of porting the whole code to OpenCL.

TL;DR: Try parallelizing on CPU first. If you find the need to run more than several 100s of processes at the same time, move to GPU.

Update: Simple code for parallelizing on CPU using multiprocessing (without callback)

import numpy as np
import time
import multiprocessing as mp
m = 3 # Dimension of datapoints
num_points = 2000 # Number of datapoints
iterMax = 150 # Maximum number of iterations
NMax = 10 # Maximum number of neurons
#%%
np.random.seed(0)
y = np.random.rand(num_points,m) # Generate always the same dataset
sigma_0 = 5 # Initial value of width of the neighborhood function
eta_0 = 1 # Initial value of learning rate
w = list(range(NMax - 1))
wClusters = np.zeros((np.size(y,axis = 0),NMax - 1)) # Clusters for each N
def neuron_run(N):
 w[N] = np.random.uniform(0,1,(N+2,np.size(y,axis=1))) - 0.5 # Initialize weights
 iterCount = 1
 while iterCount < iterMax:
 # Mix up the input patterns
 mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
 # Sigma reduction
 sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
 s2 = 2*sigma**2
 # Learning rate reduction
 eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
 for selectedInput in mixInputs: # Pick up one pattern
 # Search winning neuron
 aux = np.sum((selectedInput - w[N])**2, axis = -1)
 ii = np.argmin(aux) # Neuron 'ii' is the winner
 jjs = abs(ii - list(range(N+2)))
 dists = np.min(np.vstack([jjs , abs(jjs-(N+2))]), axis = 0)
 # Update weights
 w[N] = w[N] + eta * np.exp((-dists**2)/s2).T[:,np.newaxis] * (selectedInput - w[N])
 print(N+2,iterCount)
 iterCount += 1 
 # Assign each datapoint to its nearest neuron
 for kk in range(np.size(y,axis = 0)):
 aux = np.sum((y[kk,] - w[N])**2,axis=-1)
 ii = np.argmin(aux) # Neuron 'ii' is the winner
 wClusters[kk,N] = ii + 1
t_begin = time.clock() # Start time 
#%%
def apply_async():
 pool = mp.Pool(processes=NMax)
 for N in range(NMax-1):
 pool.apply_async(neuron_run, args = (N,))
 pool.close()
 pool.join()
 print "Multiprocessing done!"
if __name__ == '__main__':
 apply_async()
t_end = time.clock() # End time 
print(t_end - t_begin)

Question 3

Parallelizing on CPU using multiprocessing was even slower than not-parallelized. Does this make sense?

Question 4

In principle, this should not be the case. There are some cases, where scheduling and overhead (of launching threads/tasks/processes) as well as memory management are the bottleneck and the parallel programm is slower. Can you confirm (in your taskmanager) that the parallel programm is running on all cores? If only one core is used, that could be the problem. If you want more information, you can also shoot me a mail to tackle your specific problem, which I feel is to complex to be answered in the scope of this question.

Question 5

Update: Your code runs on my system in ~38s for 4 neurons and parallelized in ~ 13s (Parallel code edited in my answer)

Question 6

On 16 cores, the posted code runs in ~30s for 32 neurons.

Question 7

One more remark: Most of the time in your code is spent in the block searching the winning neuron. (For loop over mixInputs). You should start speeding this block up (probably with parallelizing).

Dschoni 3,9307 gold badges50 silver badges86 bronze badges · Accepted Answer · 2018-08-10 13:46:53Z

I'm trying to give a somewhat complete answer.

First of all:

Can this code be adapted to be run on the GPU using (py)OpenCL?

Most probably yes.

Can this been done automatically?

No (afaik).

Most of the questions I get about OpenCL are along the lines of: "Is it worth porting this piece of code to OpenCL for a speedup gain?" You are stating, that your outer loop is independent on the results of other runs, which makes the code basically parallelizable. In a straightforward implementation, each OpenCL working element would execute the same code with slightly different input parameters. Not regarding overhead by data transfer between host and device, the running time of this approach would be equal to the running time of the slowest iteration. Depending on the iterations in your outer loop, this could be a massive speed gain. As long as the numbers stay relatively small, you could try the multiprocessing module in python to parallelize these iterations on the CPU instead of the GPU.

Porting to the GPU usually only makes sense, if a huge number of processes are to be run in parallel (about 1000 or more). So in your case, if you really want an enormous speed boost, see if you can parallelize all calculations inside the loop. For example, you have 150 iterations and 2000 data points. If you could somehow parallelize these 2000 data points, this could offer a much bigger speed gain, which could justify the work of porting the whole code to OpenCL.

TL;DR: Try parallelizing on CPU first. If you find the need to run more than several 100s of processes at the same time, move to GPU.

Update: Simple code for parallelizing on CPU using multiprocessing (without callback)

import numpy as np
import time
import multiprocessing as mp
m = 3 # Dimension of datapoints
num_points = 2000 # Number of datapoints
iterMax = 150 # Maximum number of iterations
NMax = 10 # Maximum number of neurons
#%%
np.random.seed(0)
y = np.random.rand(num_points,m) # Generate always the same dataset
sigma_0 = 5 # Initial value of width of the neighborhood function
eta_0 = 1 # Initial value of learning rate
w = list(range(NMax - 1))
wClusters = np.zeros((np.size(y,axis = 0),NMax - 1)) # Clusters for each N
def neuron_run(N):
 w[N] = np.random.uniform(0,1,(N+2,np.size(y,axis=1))) - 0.5 # Initialize weights
 iterCount = 1
 while iterCount < iterMax:
 # Mix up the input patterns
 mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
 # Sigma reduction
 sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
 s2 = 2*sigma**2
 # Learning rate reduction
 eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
 for selectedInput in mixInputs: # Pick up one pattern
 # Search winning neuron
 aux = np.sum((selectedInput - w[N])**2, axis = -1)
 ii = np.argmin(aux) # Neuron 'ii' is the winner
 jjs = abs(ii - list(range(N+2)))
 dists = np.min(np.vstack([jjs , abs(jjs-(N+2))]), axis = 0)
 # Update weights
 w[N] = w[N] + eta * np.exp((-dists**2)/s2).T[:,np.newaxis] * (selectedInput - w[N])
 print(N+2,iterCount)
 iterCount += 1 
 # Assign each datapoint to its nearest neuron
 for kk in range(np.size(y,axis = 0)):
 aux = np.sum((y[kk,] - w[N])**2,axis=-1)
 ii = np.argmin(aux) # Neuron 'ii' is the winner
 wClusters[kk,N] = ii + 1
t_begin = time.clock() # Start time 
#%%
def apply_async():
 pool = mp.Pool(processes=NMax)
 for N in range(NMax-1):
 pool.apply_async(neuron_run, args = (N,))
 pool.close()
 pool.join()
 print "Multiprocessing done!"
if __name__ == '__main__':
 apply_async()
t_end = time.clock() # End time 
print(t_end - t_begin)

Parallelizing on CPU using multiprocessing was even slower than not-parallelized. Does this make sense?
In principle, this should not be the case. There are some cases, where scheduling and overhead (of launching threads/tasks/processes) as well as memory management are the bottleneck and the parallel programm is slower. Can you confirm (in your taskmanager) that the parallel programm is running on all cores? If only one core is used, that could be the problem. If you want more information, you can also shoot me a mail to tackle your specific problem, which I feel is to complex to be answered in the scope of this question.
Update: Your code runs on my system in ~38s for 4 neurons and parallelized in ~ 13s (Parallel code edited in my answer)
One more remark: Most of the time in your code is spent in the block searching the winning neuron. (For loop over mixInputs). You should start speeding this block up (probably with parallelizing).

CollectivesTM on Stack Overflow

Parallelize loops using OpenCL in Python

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related