Reduction, or
FoldList , computes the last value of the sequence given and a starting value
x:
This loads the reduction code into
Mathematica. For performance reasons, the block dimensions are specified. Since the input is a power of 2, you set the define to
True :
This specifies the input and allocates memory for the output:
You can call again with to further reduce, although since the output is of length 128, there is no performance benefit of doing this over just calling
Mathematica's
Total function:
The result of adding a 65536 constant vector of 1 agrees with
Mathematica's
Total function:
The source code can be modified to implement operations other than sum. In this case, you count the number of occurrences of the number 5 in the list:
OpenCLFunction is loaded the same as before, except that you use a block size of 512:
Use random integers as input:
This invokes the function. Again, you use Mathematica to total the 1024 remaining numbers:
The result agrees with
Count :
This times the launch:
Note the 7× performance increase:
If the input is already on the GPU, then a performance increase of 52× is observed:
The input can be images; here you write code that performs linear interpolation between images (this can be done using
ImageCompose ):
This sets the , , and values. It also allocates memory for the :
This calls the function with threads:
This gets the memory and displays it as an image:
You can take the above and make a function :
A
Manipulate can be used to play with the interpolation coefficients:
Effects can be made; in this example, a smooth animation is viewed:
Uniform random number generators are common seed problems in many applications. This implements uniform random number generators in OpenCL:
This loads the source as an
OpenCLFunction . This algorithm uses an image to provide an upper bound to the random number:
This calls
OpenCLFunction ; note that you can pass images directly into an
OpenCLFunction so long as it can be interpreted using the appropriate specified type:
Notice that this is not a regular Lena image; it is a 4-channel image with alpha channel set to 1 (using
SetAlphaChannel ):
The random output can be used to detect important edges in an image:
The Mersenne Twister is another uniform random number generator algorithm (more sophisticated than the one mentioned above). The implementation is located here:
This loads
OpenCLFunction ; you specify the type , which means that the type is dependent on the CPU capabilities (whether it supports double precision or not):
This sets up the Mersenne Twister's input and output parameters (for more information, refer to the algorithm description):
This plots the output's results:
If the output is timed:
There is almost an 11× increase in speed:
The scan, or prefix sum, algorithm is similar to
FoldList and is a very useful primitive algorithm that can be used in a variety of scenarios. The OpenCL implementation is found in:
This loads the three kernels used in computation:
This generates random input data:
This allocates the output buffer:
This computes the block and grid dimensions:
A temporary buffer is needed in computation:
This performs the scan operation:
This retrieves the output buffer:
The Sobel filter (only horizontal will be discussed in this example) is a convolution with the kernel ; this is called the kernel. Using local memory in this case will increase the performance—see the Sobel filter in the Canny edge-detection example for a faster implementation. A trivial implementation in OpenCL is located here:
This specifies the , , and values of the image:
One property of the Sobel filter is that it is separable. This means that the kernel above can be represented by first convolving with and then ; these are called the and , respectively. The following loads the functions from the same file:
If a convolution kernel is separable, then composing the separable filters is more efficient than performing the full convolution, since it reduces the amount of needed memory copies.
Matrix transpose is a fundamental algorithm in many applications. This specifies the inputs:
The result agrees with Mathematica:
Matrix multiplication is implemented here:
This defines the block size:
This loads
OpenCLFunction ; note it is specified that the input must be rank 2:
This creates random input and allocates the output:
The output agrees with Mathematica:
The one-dimensional discrete fast Fourier transform can be implemented using OpenCLLink; this implementation assumes that the input is a power of 2:
This creates input and output lists:
This calls the output memory and creates a complex list, displaying only the first 50 elements:
Black-Scholes models financial derivative investments and is implemented in OpenCL:
This assigns the input parameters:
This gets the call values:
For timing, the number of options to be valuated is increased:
On the C2050, it takes 1/100 of a second to valuate 2048 options:
On a Core i7 950,
FinancialDerivative takes 1.13 seconds. This is a speedup of 280×. Note that increasing the number of options will exhibit more speedups:
Recursive Gaussian is used to approximate the Gaussian filter. The algorithm relies on the fact that the Gaussian matrix is separable:
It can be written as the outer product of a 1D Gaussian:
The recursive Gaussian is implemented here:
This specifies the value. Recall that the Gaussian is :
With set to 5.0, the normal distribution is calculated:
Mathematica can plot the distribution:
Here you calculate the recursive Gaussian parameters:
The input, temporary, and output memory is loaded as
OpenCLMemory :
Here you call the functions. First you perform the Gaussian horizontally, then transpose, then perform the Gaussian vertically, then transpose to get the full Gaussian:
This gets the output image:
Again you compare timing:
And notice a 10× performance boost:
Bitonic sort sorts a given set of integers. It is similar in principal to merge sort. The OpenCL implementation only works on lists of length of a power of 2 and can be found here:
This sets the length of the input and loads it. The direction denotes whether to sort from highest to lowest or lowest to highest. In this case, you sort from lowest to highest:
This gets the input list:
This calls bitonic sort, similar to merge sort; multiple calls are needed for a full sort:
The output list is retrieved sorted: