gpuIncreaseOne Function Implementation in CUDA

Question 1

I am trying to perform the basic operations + with CUDA for GPU computation. The function vectorIncreaseOne is the instance for the operation details and gpuIncreaseOne function is the structure for applying the operation to each element in the parameter data_for_calculation.

The experimental implementation

The experimental implementation of gpuIncreaseOne function is as below.

#include <stdio.h>
#include <cuda_runtime.h>
#include <cuda.h>
#include <helper_cuda.h>
#include <math.h>
__global__ void CUDACalculation::vectorIncreaseOne(const long double* input, long double* output, int numElements)
{
 int i = blockDim.x * blockIdx.x + threadIdx.x;
 if (i < numElements)
 {
 if (input[i] < 255)
 {
 output[i] = input[i] + 1;
 }
 }
}
int CUDACalculation::gpuIncreaseOne(float* data_for_calculation, int size)
{
 // Error code to check return values for CUDA calls
 cudaError_t err = cudaSuccess;
 // Print the vector length to be used, and compute its size
 int numElements = size;
 size_t DataSize = numElements * sizeof(float);
 // Allocate the device input vector A
 float *d_A = NULL;
 err = cudaMalloc((void **)&d_A, DataSize);
 if (err != cudaSuccess)
 {
 fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));
 exit(EXIT_FAILURE);
 }
 // Allocate the device input vector B
 float *d_B = NULL;
 err = cudaMalloc((void **)&d_B, DataSize);
 if (err != cudaSuccess)
 {
 fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));
 exit(EXIT_FAILURE);
 }
 // Allocate the device output vector C
 float *d_C = NULL;
 err = cudaMalloc((void **)&d_C, DataSize);
 if (err != cudaSuccess)
 {
 fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));
 exit(EXIT_FAILURE);
 }
 // Copy the host input vectors A and B in host memory to the device input vectors in
 // device memory
 err = cudaMemcpy(d_A, data_for_calculation, DataSize, cudaMemcpyHostToDevice);
 if (err != cudaSuccess)
 {
 fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));
 exit(EXIT_FAILURE);
 }
 // Launch the Vector Add CUDA Kernel
 int threadsPerBlock = 256;
 int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
 printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
 vectorIncreaseOne <<<blocksPerGrid, threadsPerBlock>>>(d_A, d_C, numElements);
 err = cudaGetLastError();
 if (err != cudaSuccess)
 {
 fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));
 exit(EXIT_FAILURE);
 }
 // Copy the device result vector in device memory to the host result vector
 // in host memory.
 err = cudaMemcpy(data_for_calculation, d_C, DataSize, cudaMemcpyDeviceToHost);
 if (err != cudaSuccess)
 {
 fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));
 exit(EXIT_FAILURE);
 }
 // Free device global memory
 err = cudaFree(d_A);
 if (err != cudaSuccess)
 {
 fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));
 exit(EXIT_FAILURE);
 }
 err = cudaFree(d_B);
 if (err != cudaSuccess)
 {
 fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));
 exit(EXIT_FAILURE);
 }
 err = cudaFree(d_C);
 if (err != cudaSuccess)
 {
 fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));
 exit(EXIT_FAILURE);
 }
 return 0;
}

Test cases

The test case for gpuIncreaseOne function is as below.

auto data_pointer = (float*)malloc(100 * sizeof(float));
for (int i = 0; i < 100; i++)
{
 data_pointer[i] = static_cast<float>(1);
}
CUDACalculation::gpuIncreaseOne(data_pointer, 100);
free(data_pointer);

All suggestions are welcome.

If there is any possible improvement about:

Potential drawback or unnecessary overhead
Error handling

please let me know.

Question 2

In your kernel, vectorIncreaseOne is using long double* types. According to NIVIDIA forum, there is no support for long double. Since your launch data parameters are float* types, use that instead.

Change both size tokens to include units (i.e., bytes vs elements):

gpuIncreaseOne(float* data_for_calculation, const int numElements) // renamed size
{
// int numElements = size; // delete this line
size_t DataSizeBytes = numElements * sizeof(float); // Added units (Bytes)

In general, variable names should be easily recognizable a week or a year from now. Since the kernel has global (off-chip) and local (on-chip shared) memory, the index name should indicate the choice. Some names recommended in NIVIDIA training course are gindex for global and lindex for local shared. Or since you seem to prefer shorter names, gIdx and lIdx will suffice.

int gindex = blockDim.x * blockIdx.x + threadIdx.x; // renamed i to gindex

NIT: In general, single char names, such as i in a for-loop counter is IMO better named to a more unique name, such as ii, or better, a name that conveys the meaning of the counter. The reason for this is that when your code is posted in a review board, and the reviewer wants to see all references to the counter, too many hits will occur if the name i is used.

When a number is used within a statement, we call them magic numbers. Often reviewers might not instantly understand the meaning and the units associated with these numbers. Recommend that you start using constants with meaningful names and the units.

All the necessary CUDA error handling takes up a lot of lines and clutters up the program. Here's a cleaner approach.

// CUDA error checking macro
#define CHECK(call) \
do { \
 const cudaError_t error = (call); \
 if (error != cudaSuccess) \
 { \
 printf("Error: %s:%d, ", __FILE__, __LINE__); \
 printf("code:%d, reason: %s\n", error, cudaGetErrorString(error)); \
 printf("*** FAILED - ABORTING\n"); \
 exit(EXIT_FAILURE); \
 } \
} while (0)

Now you can write:

CHECK(cudaMemcpy(h_out, d_out, size_N_Bytes, cudaMemcpyDeviceToHost));

For the CHECK macro, the do-while idiom is used:

"macros in Linux and other codebases wrap their logic in do/while(0) because it ensures the macro always behaves the same, regardless of how semicolons and curly-brackets are used in the invoking code."

Variable d_B does not appear to have any purpose, so remove it.

Question 3

@TobySpeight, Removed std::fill() part. Thx.

Question 4

@TobySpeight, Looking at the OP again, the "" indicates C++, not C. I will change the tag from C to C++.

Question 5

Changed c to C++ due to usage of static_cast<float>.

PaulH PaulH 1734 bronze badges · Accepted Answer · 2025-06-03 13:43:14Z

In your kernel, vectorIncreaseOne is using long double* types. According to NIVIDIA forum, there is no support for long double. Since your launch data parameters are float* types, use that instead.

Change both size tokens to include units (i.e., bytes vs elements):

gpuIncreaseOne(float* data_for_calculation, const int numElements) // renamed size
{
// int numElements = size; // delete this line
size_t DataSizeBytes = numElements * sizeof(float); // Added units (Bytes)

In general, variable names should be easily recognizable a week or a year from now. Since the kernel has global (off-chip) and local (on-chip shared) memory, the index name should indicate the choice. Some names recommended in NIVIDIA training course are gindex for global and lindex for local shared. Or since you seem to prefer shorter names, gIdx and lIdx will suffice.

int gindex = blockDim.x * blockIdx.x + threadIdx.x; // renamed i to gindex

NIT: In general, single char names, such as i in a for-loop counter is IMO better named to a more unique name, such as ii, or better, a name that conveys the meaning of the counter. The reason for this is that when your code is posted in a review board, and the reviewer wants to see all references to the counter, too many hits will occur if the name i is used.

When a number is used within a statement, we call them magic numbers. Often reviewers might not instantly understand the meaning and the units associated with these numbers. Recommend that you start using constants with meaningful names and the units.

All the necessary CUDA error handling takes up a lot of lines and clutters up the program. Here's a cleaner approach.

// CUDA error checking macro
#define CHECK(call) \
do { \
 const cudaError_t error = (call); \
 if (error != cudaSuccess) \
 { \
 printf("Error: %s:%d, ", __FILE__, __LINE__); \
 printf("code:%d, reason: %s\n", error, cudaGetErrorString(error)); \
 printf("*** FAILED - ABORTING\n"); \
 exit(EXIT_FAILURE); \
 } \
} while (0)

Now you can write:

CHECK(cudaMemcpy(h_out, d_out, size_N_Bytes, cudaMemcpyDeviceToHost));

For the CHECK macro, the do-while idiom is used:

"macros in Linux and other codebases wrap their logic in do/while(0) because it ensures the macro always behaves the same, regardless of how semicolons and curly-brackets are used in the invoking code."

Variable d_B does not appear to have any purpose, so remove it.

@TobySpeight, Looking at the OP again, the "" indicates C++, not C. I will change the tag from C to C++.

Stack Exchange Network

gpuIncreaseOne Function Implementation in CUDA

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

gpuIncreaseOne Function Implementation in CUDA

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions