Calculating the distance between several spatial points

Question 1

I am developing a CUDA program and I want to enhance my performance. I have a kernel function which is consuming more than 70% of execution time. The kernel calculates the distance between several spatial points and based on whether they are neighbors or not, it fills a boolean vector.

Any ideas on how to get more speedup?

Here is the code:

#include <cuda.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <iostream>
#define _SQR(a) ((a)*(a))
#define _BLOCKSIZE 32
__host__ void RandGen(double* A, int n){
 double a = 1.0;
 for (int i = 0; i < n; i++) {
 A[i] = (double)std::rand()/(double)(RAND_MAX)*a;
 }
}
//kernel for parallel distance check
__global__ void DistanceChecker(double* xPos, double* yPos, double* zPos, double* h,
 int* particles1, int* particles2,
 int NumberOfP1, int NumberOfP2, bool* distance)
{
 unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
 unsigned int idy = blockIdx.y * blockDim.y + threadIdx.y;
 double DISTANCE;
 if(idx < NumberOfP1 && idy < NumberOfP2){
 DISTANCE = _SQR(xPos[particles1[idx]] - xPos[particles2[idy]])+
 _SQR(yPos[particles1[idx]] - yPos[particles2[idy]])+
 _SQR(zPos[particles1[idx]] - zPos[particles2[idy]]);
 distance [idy + NumberOfP2 * idx] = (DISTANCE < _SQR(h[particles1[idx]] + h[particles2[idy]]));
 }
}
 int main( int argc, char* argv[] ){
 cudaEvent_t start, stop;
 cudaEventCreate(&start);
 cudaEventCreate(&stop);
 int num = 1024; // number of particles
 thrust::host_vector<double> h_xPos(num), h_yPos(num), h_zPos(num), h_h(num,0.001);
 std::srand(11);
 RandGen(&h_xPos[0],num);
 std::srand(15);
 RandGen(&h_yPos[0],num);
 std::srand(19);
 RandGen(&h_zPos[0],num);
 thrust::device_vector<double> d_xPos(h_xPos), d_yPos(h_yPos), d_zPos(h_zPos), d_h(h_h);
 float dummymili;
 float distanceCheck = 0.f;
 int nBranches = 1024;
 for (int i = 0; i < nBranches; i++) {
 thrust::device_vector<int> particles1(500);
 thrust::device_vector<int> particles2(500);
 thrust::device_vector<bool> distance(particles1.size()*particles2.size(), true);
 dim3 blockSize(32,32); // also tested for blockSize(16,16)
 dim3 gridSize;
 gridSize.x = (particles1.size() + blockSize.x - 1) / blockSize.x;
 gridSize.y = (particles2.size() + blockSize.y - 1) / blockSize.y;
 cudaEventRecord(start);
 DistanceChecker<<<gridSize,blockSize>>>(
 thrust::raw_pointer_cast(&d_xPos[0]),
 thrust::raw_pointer_cast(&d_yPos[0]),
 thrust::raw_pointer_cast(&d_zPos[0]),
 thrust::raw_pointer_cast(&d_h[0]),
 thrust::raw_pointer_cast(&particles1[0]),
 thrust::raw_pointer_cast(&particles2[0]),
 particles1.size(), particles2.size(),
 thrust::raw_pointer_cast(&distance[0]));
 cudaDeviceSynchronize();
 cudaEventRecord(stop);
 cudaEventSynchronize(stop);
 cudaEventElapsedTime(&dummymili, start, stop);
 distanceCheck += dummymili;
 }
 std::cout << "KERNEL TIME = " << distanceCheck << " milliseconds" << std::endl;
 return 0;
 }

I have sorted the data in the original code before using them in kernel which I think it has a positive effect on future memory accesses. So please consider that the data have sorted to optimize the memory access.

All the calculations must be done with double precision to decrease the round off errors.

The GPU device I am using is NVidia Quadro K2000, and my CUDA version is 7.5.

Question 2

I don't see you checking for cuda errors anywhere - any particular reason?

Question 3

@Dannnno: The code works and there is not a bug. I just need comment(s) to improve the DistanceChecker kernel performance.

Question 4

I'm not saying there's a bug - I'm just saying that checking for cuda errors is generally a Good thing to do

Question 5

Why don't you use std::hypot?

Question 6

[It would be helpful to know the time the code needs to run on your GPU in total and the kernel time. See this as a comment as I cannot comment yet...]

Two suggestions why your runtime is so long:

Hardware

As you want to do the calculations with double precision you should look out for hardware that provides many more double precision units. Your GPU (Quadro K2000) has only 384/24*2 = 32 of them (cf. anandtech.com). This results in a peak performance of about 15 GFLOP/s only (~0.95GHz).

Workload

Another problem is the small problem size. You are launching kernels with a grid size of only 16x16 blocks or ~250k threads. Additionally every thread has only 10 double precision operations (14 if a is calculated twice for a*a) which results in a total of 2.5 MFLOP (or 3.5 MFLOP). Even for your GPU, the kernel runtime for peak performance would be only about 0.17ms (or 0.23ms). GPUs reach maximal performance as the problem size grows.
On my GPU without doing any calculations I still get about 20% of the runtime with calculations (without optimization flags). (Inaccurate time measuring may be a problem, too.)

You may want to test your code on another GPU or use single precision and check if it runs noticeably faster.

There are ways to do calculations in DP without DP units. But I don't know whether that is reasonable for GPU computing.

Question 7

#define _SQR(a) ((a)*(a))
#define _BLOCKSIZE 32

Identifiers beginning with an underscore followed by uppercase letter are reserved to the implementation for any purpose. That means that these definitions may break your standard library headers, for example.

Remove the underscore and these will be fine.

It's also a good idea to avoid using all-caps names for ordinary variables (DISTANCE), as that makes them look like macros to most C or C++ programmers.

Question 8

I tried your code and did not use your

#define _SQR(a) ((a)*(a))

by replacing

DISTANCE = _SQR(xPos[particles1[idx]] - xPos[particles2[idy]])+
 _SQR(yPos[particles1[idx]] - yPos[particles2[idy]])+
 _SQR(zPos[particles1[idx]] - zPos[particles2[idy]]);

with this

double x, y, z;
x = (xPos1[idx] - xPos2[i]);
y = (yPos1[idx] - yPos2[i]);
z = (zPos1[idx] - zPos2[i]);
 
DISTANCE = x*x + y*y + z*z;

the time came down from 2600ms to 1800ms for 22.5 billion points.

Question 9

Note that this would probably work with _SQR as well. The key thing is to do the subtractions prior to the _SQR.

score 3 · Answer 1 · 2017-06-06 13:49:28Z

[It would be helpful to know the time the code needs to run on your GPU in total and the kernel time. See this as a comment as I cannot comment yet...]

Two suggestions why your runtime is so long:

Hardware

As you want to do the calculations with double precision you should look out for hardware that provides many more double precision units. Your GPU (Quadro K2000) has only 384/24*2 = 32 of them (cf. anandtech.com). This results in a peak performance of about 15 GFLOP/s only (~0.95GHz).

Workload

Another problem is the small problem size. You are launching kernels with a grid size of only 16x16 blocks or ~250k threads. Additionally every thread has only 10 double precision operations (14 if a is calculated twice for a*a) which results in a total of 2.5 MFLOP (or 3.5 MFLOP). Even for your GPU, the kernel runtime for peak performance would be only about 0.17ms (or 0.23ms). GPUs reach maximal performance as the problem size grows.
On my GPU without doing any calculations I still get about 20% of the runtime with calculations (without optimization flags). (Inaccurate time measuring may be a problem, too.)

You may want to test your code on another GPU or use single precision and check if it runs noticeably faster.

There are ways to do calculations in DP without DP units. But I don't know whether that is reasonable for GPU computing.

score 1 · Answer 2 · 2021-08-31 07:57:28Z

#define _SQR(a) ((a)*(a))
#define _BLOCKSIZE 32

Identifiers beginning with an underscore followed by uppercase letter are reserved to the implementation for any purpose. That means that these definitions may break your standard library headers, for example.

Remove the underscore and these will be fine.

It's also a good idea to avoid using all-caps names for ordinary variables (DISTANCE), as that makes them look like macros to most C or C++ programmers.

user247719 user247719user247719 1 · Answer 3 · 2021-08-30 18:24:36Z

I tried your code and did not use your

#define _SQR(a) ((a)*(a))

by replacing

DISTANCE = _SQR(xPos[particles1[idx]] - xPos[particles2[idy]])+
 _SQR(yPos[particles1[idx]] - yPos[particles2[idy]])+
 _SQR(zPos[particles1[idx]] - zPos[particles2[idy]]);

with this

double x, y, z;
x = (xPos1[idx] - xPos2[i]);
y = (yPos1[idx] - yPos2[i]);
z = (zPos1[idx] - zPos2[i]);
 
DISTANCE = x*x + y*y + z*z;

the time came down from 2600ms to 1800ms for 22.5 billion points.

Note that this would probably work with _SQR as well. The key thing is to do the subtractions prior to the _SQR.

Stack Exchange Network

Calculating the distance between several spatial points

3 Answers 3

Hardware

Workload

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Calculating the distance between several spatial points

3 Answers 3

Hardware

Workload

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions