I am trying to perform the basic operations +
with CUDA for GPU computation. The function vectorIncreaseOne
is the instance for the operation details and gpuIncreaseOne
function is the structure for applying the operation to each element in the parameter data_for_calculation
.
The experimental implementation
The experimental implementation of gpuIncreaseOne
function is as below.
#include <stdio.h>
#include <cuda_runtime.h>
#include <cuda.h>
#include <helper_cuda.h>
#include <math.h>
__global__ void CUDACalculation::vectorIncreaseOne(const long double* input, long double* output, int numElements)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements)
{
if (input[i] < 255)
{
output[i] = input[i] + 1;
}
}
}
int CUDACalculation::gpuIncreaseOne(float* data_for_calculation, int size)
{
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;
// Print the vector length to be used, and compute its size
int numElements = size;
size_t DataSize = numElements * sizeof(float);
// Allocate the device input vector A
float *d_A = NULL;
err = cudaMalloc((void **)&d_A, DataSize);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Allocate the device input vector B
float *d_B = NULL;
err = cudaMalloc((void **)&d_B, DataSize);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Allocate the device output vector C
float *d_C = NULL;
err = cudaMalloc((void **)&d_C, DataSize);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Copy the host input vectors A and B in host memory to the device input vectors in
// device memory
err = cudaMemcpy(d_A, data_for_calculation, DataSize, cudaMemcpyHostToDevice);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
vectorIncreaseOne <<<blocksPerGrid, threadsPerBlock>>>(d_A, d_C, numElements);
err = cudaGetLastError();
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Copy the device result vector in device memory to the host result vector
// in host memory.
err = cudaMemcpy(data_for_calculation, d_C, DataSize, cudaMemcpyDeviceToHost);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Free device global memory
err = cudaFree(d_A);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaFree(d_B);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaFree(d_C);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
return 0;
}
Test cases
The test case for gpuIncreaseOne
function is as below.
auto data_pointer = (float*)malloc(100 * sizeof(float));
for (int i = 0; i < 100; i++)
{
data_pointer[i] = static_cast<float>(1);
}
CUDACalculation::gpuIncreaseOne(data_pointer, 100);
free(data_pointer);
All suggestions are welcome.
If there is any possible improvement about:
- Potential drawback or unnecessary overhead
- Error handling
please let me know.
1 Answer 1
In your kernel, vectorIncreaseOne
is using long double*
types. According to NIVIDIA forum, there is no support for long double
.
Since your launch data parameters are float*
types, use that instead.
Change both size tokens to include units (i.e., bytes vs elements):
gpuIncreaseOne(float* data_for_calculation, const int numElements) // renamed size
{
// int numElements = size; // delete this line
size_t DataSizeBytes = numElements * sizeof(float); // Added units (Bytes)
In general, variable names should be easily recognizable a week or a year from now. Since the kernel has global
(off-chip) and local
(on-chip shared
) memory, the index name should indicate the choice. Some names recommended in NIVIDIA training course are gindex
for global
and lindex
for local shared
. Or since you seem to prefer shorter names, gIdx
and lIdx
will suffice.
int gindex = blockDim.x * blockIdx.x + threadIdx.x; // renamed i to gindex
NIT: In general, single char names, such as i
in a for-loop counter is IMO better named to a more unique name, such as ii
, or better, a name that conveys the meaning of the counter. The reason for this is that when your code is posted in a review board, and the reviewer wants to see all references to the counter, too many hits will occur if the name i
is used.
When a number is used within a statement, we call them magic numbers. Often reviewers might not instantly understand the meaning and the units associated with these numbers. Recommend that you start using constants with meaningful names and the units.
All the necessary CUDA error handling takes up a lot of lines and clutters up the program. Here's a cleaner approach.
// CUDA error checking macro
#define CHECK(call) \
do { \
const cudaError_t error = (call); \
if (error != cudaSuccess) \
{ \
printf("Error: %s:%d, ", __FILE__, __LINE__); \
printf("code:%d, reason: %s\n", error, cudaGetErrorString(error)); \
printf("*** FAILED - ABORTING\n"); \
exit(EXIT_FAILURE); \
} \
} while (0)
Now you can write:
CHECK(cudaMemcpy(h_out, d_out, size_N_Bytes, cudaMemcpyDeviceToHost));
For the CHECK
macro, the do-while idiom is used:
"macros in Linux and other codebases wrap their logic in
do
/while(0)
because it ensures the macro always behaves the same, regardless of how semicolons and curly-brackets are used in the invoking code."
Variable d_B
does not appear to have any purpose, so remove it.
-
\$\begingroup\$ @TobySpeight, Removed
std::fill()
part. Thx. \$\endgroup\$PaulH– PaulH2025年06月09日 16:03:10 +00:00Commented Jun 9 at 16:03 -
\$\begingroup\$ @TobySpeight, Looking at the OP again, the
""
indicates C++, not C. I will change the tag from C to C++. \$\endgroup\$PaulH– PaulH2025年06月20日 04:54:07 +00:00Commented Jun 20 at 4:54 -
\$\begingroup\$ Changed c to C++ due to usage of static_cast<float>. \$\endgroup\$PaulH– PaulH2025年07月29日 13:11:12 +00:00Commented Jul 29 at 13:11
Explore related questions
See similar questions with these tags.