3
\$\begingroup\$

I am developing a CUDA program and I want to enhance my performance. I have a kernel function which is consuming more than 70% of execution time. The kernel calculates the distance between several spatial points and based on whether they are neighbors or not, it fills a boolean vector.

Any ideas on how to get more speedup?

Here is the code:

#include <cuda.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <iostream>
#define _SQR(a) ((a)*(a))
#define _BLOCKSIZE 32
__host__ void RandGen(double* A, int n){
 double a = 1.0;
 for (int i = 0; i < n; i++) {
 A[i] = (double)std::rand()/(double)(RAND_MAX)*a;
 }
}
//kernel for parallel distance check
__global__ void DistanceChecker(double* xPos, double* yPos, double* zPos, double* h,
 int* particles1, int* particles2,
 int NumberOfP1, int NumberOfP2, bool* distance)
{
 unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
 unsigned int idy = blockIdx.y * blockDim.y + threadIdx.y;
 double DISTANCE;
 if(idx < NumberOfP1 && idy < NumberOfP2){
 DISTANCE = _SQR(xPos[particles1[idx]] - xPos[particles2[idy]])+
 _SQR(yPos[particles1[idx]] - yPos[particles2[idy]])+
 _SQR(zPos[particles1[idx]] - zPos[particles2[idy]]);
 distance [idy + NumberOfP2 * idx] = (DISTANCE < _SQR(h[particles1[idx]] + h[particles2[idy]]));
 }
}
 int main( int argc, char* argv[] ){
 cudaEvent_t start, stop;
 cudaEventCreate(&start);
 cudaEventCreate(&stop);
 int num = 1024; // number of particles
 thrust::host_vector<double> h_xPos(num), h_yPos(num), h_zPos(num), h_h(num,0.001);
 std::srand(11);
 RandGen(&h_xPos[0],num);
 std::srand(15);
 RandGen(&h_yPos[0],num);
 std::srand(19);
 RandGen(&h_zPos[0],num);
 thrust::device_vector<double> d_xPos(h_xPos), d_yPos(h_yPos), d_zPos(h_zPos), d_h(h_h);
 float dummymili;
 float distanceCheck = 0.f;
 int nBranches = 1024;
 for (int i = 0; i < nBranches; i++) {
 thrust::device_vector<int> particles1(500);
 thrust::device_vector<int> particles2(500);
 thrust::device_vector<bool> distance(particles1.size()*particles2.size(), true);
 dim3 blockSize(32,32); // also tested for blockSize(16,16)
 dim3 gridSize;
 gridSize.x = (particles1.size() + blockSize.x - 1) / blockSize.x;
 gridSize.y = (particles2.size() + blockSize.y - 1) / blockSize.y;
 cudaEventRecord(start);
 DistanceChecker<<<gridSize,blockSize>>>(
 thrust::raw_pointer_cast(&d_xPos[0]),
 thrust::raw_pointer_cast(&d_yPos[0]),
 thrust::raw_pointer_cast(&d_zPos[0]),
 thrust::raw_pointer_cast(&d_h[0]),
 thrust::raw_pointer_cast(&particles1[0]),
 thrust::raw_pointer_cast(&particles2[0]),
 particles1.size(), particles2.size(),
 thrust::raw_pointer_cast(&distance[0]));
 cudaDeviceSynchronize();
 cudaEventRecord(stop);
 cudaEventSynchronize(stop);
 cudaEventElapsedTime(&dummymili, start, stop);
 distanceCheck += dummymili;
 }
 std::cout << "KERNEL TIME = " << distanceCheck << " milliseconds" << std::endl;
 return 0;
 }

I have sorted the data in the original code before using them in kernel which I think it has a positive effect on future memory accesses. So please consider that the data have sorted to optimize the memory access.

All the calculations must be done with double precision to decrease the round off errors.

The GPU device I am using is NVidia Quadro K2000, and my CUDA version is 7.5.

asked Jan 6, 2016 at 18:22
\$\endgroup\$
4
  • 1
    \$\begingroup\$ I don't see you checking for cuda errors anywhere - any particular reason? \$\endgroup\$ Commented Jan 6, 2016 at 20:26
  • \$\begingroup\$ @Dannnno: The code works and there is not a bug. I just need comment(s) to improve the DistanceChecker kernel performance. \$\endgroup\$ Commented Jan 6, 2016 at 23:11
  • 1
    \$\begingroup\$ I'm not saying there's a bug - I'm just saying that checking for cuda errors is generally a Good thing to do \$\endgroup\$ Commented Jan 7, 2016 at 0:13
  • \$\begingroup\$ Why don't you use std::hypot? \$\endgroup\$ Commented Sep 1, 2021 at 5:07

3 Answers 3

3
\$\begingroup\$

[It would be helpful to know the time the code needs to run on your GPU in total and the kernel time. See this as a comment as I cannot comment yet...]

Two suggestions why your runtime is so long:

Hardware

As you want to do the calculations with double precision you should look out for hardware that provides many more double precision units. Your GPU (Quadro K2000) has only 384/24*2 = 32 of them (cf. anandtech.com). This results in a peak performance of about 15 GFLOP/s only (~0.95GHz).

Workload

Another problem is the small problem size. You are launching kernels with a grid size of only 16x16 blocks or ~250k threads. Additionally every thread has only 10 double precision operations (14 if a is calculated twice for a*a) which results in a total of 2.5 MFLOP (or 3.5 MFLOP). Even for your GPU, the kernel runtime for peak performance would be only about 0.17ms (or 0.23ms). GPUs reach maximal performance as the problem size grows.
On my GPU without doing any calculations I still get about 20% of the runtime with calculations (without optimization flags). (Inaccurate time measuring may be a problem, too.)


You may want to test your code on another GPU or use single precision and check if it runs noticeably faster.

There are ways to do calculations in DP without DP units. But I don't know whether that is reasonable for GPU computing.

answered Jun 6, 2017 at 13:49
\$\endgroup\$
1
\$\begingroup\$
#define _SQR(a) ((a)*(a))
#define _BLOCKSIZE 32

Identifiers beginning with an underscore followed by uppercase letter are reserved to the implementation for any purpose. That means that these definitions may break your standard library headers, for example.

Remove the underscore and these will be fine.

It's also a good idea to avoid using all-caps names for ordinary variables (DISTANCE), as that makes them look like macros to most C or C++ programmers.

answered Aug 31, 2021 at 7:57
\$\endgroup\$
0
\$\begingroup\$

I tried your code and did not use your

#define _SQR(a) ((a)*(a))

by replacing

DISTANCE = _SQR(xPos[particles1[idx]] - xPos[particles2[idy]])+
 _SQR(yPos[particles1[idx]] - yPos[particles2[idy]])+
 _SQR(zPos[particles1[idx]] - zPos[particles2[idy]]);

with this

double x, y, z;
x = (xPos1[idx] - xPos2[i]);
y = (yPos1[idx] - yPos2[i]);
z = (zPos1[idx] - zPos2[i]);
 
DISTANCE = x*x + y*y + z*z;

the time came down from 2600ms to 1800ms for 22.5 billion points.

Stephen Rauch
4,31412 gold badges24 silver badges36 bronze badges
answered Aug 30, 2021 at 18:24
\$\endgroup\$
1
  • 1
    \$\begingroup\$ Note that this would probably work with _SQR as well. The key thing is to do the subtractions prior to the _SQR. \$\endgroup\$ Commented Aug 30, 2021 at 19:25

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.