|
| 1 | +# Chapter 5: Debugging and Profiling CUDA Programs |
| 2 | + |
| 3 | +In this chapter, we will explore various tools and techniques to debug and profile CUDA programs. Debugging helps identify and fix errors in your code, while profiling helps optimize performance by analyzing how your code executes on the GPU. |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | + |
| 7 | +1. [Introduction](#introduction) |
| 8 | +2. [Debugging CUDA Programs](#debugging-cuda-programs) |
| 9 | + - [Using cuda-gdb](#using-cuda-gdb) |
| 10 | + - [Using Nsight Eclipse Edition](#using-nsight-eclipse-edition) |
| 11 | + - [Using Nsight Visual Studio Edition](#using-nsight-visual-studio-edition) |
| 12 | +3. [Profiling CUDA Programs](#profiling-cuda-programs) |
| 13 | + - [Using Nsight Systems](#using-nsight-systems) |
| 14 | + - [Using Nsight Compute](#using-nsight-compute) |
| 15 | +4. [Sample Code](#sample-code) |
| 16 | +5. [Additional Resources](#additional-resources) |
| 17 | + |
| 18 | +## Introduction |
| 19 | + |
| 20 | +Debugging and profiling are essential steps in CUDA development. They help ensure your code runs correctly and efficiently on the GPU. This guide will introduce you to the tools and techniques needed to debug and profile CUDA programs effectively. |
| 21 | + |
| 22 | +## Debugging CUDA Programs |
| 23 | + |
| 24 | +### Using cuda-gdb |
| 25 | + |
| 26 | +`cuda-gdb` is a powerful debugger for CUDA applications. It allows you to set breakpoints, step through code, and inspect variables. |
| 27 | + |
| 28 | +1. **Compile with Debug Information**: |
| 29 | + ```sh |
| 30 | + nvcc -g -G -o debapro debapro.cu |
| 31 | + ``` |
| 32 | + |
| 33 | +2. **Start cuda-gdb**: |
| 34 | + ```sh |
| 35 | + cuda-gdb ./debapro |
| 36 | + ``` |
| 37 | + |
| 38 | +3. **Set Breakpoints**: |
| 39 | + ```sh |
| 40 | + (cuda-gdb) break main |
| 41 | + ``` |
| 42 | + |
| 43 | +4. **Run the Program**: |
| 44 | + ```sh |
| 45 | + (cuda-gdb) run |
| 46 | + ``` |
| 47 | + |
| 48 | +5. **Step Through Code**: |
| 49 | + ```sh |
| 50 | + (cuda-gdb) next |
| 51 | + ``` |
| 52 | + |
| 53 | +6. **Inspect Variables**: |
| 54 | + ```sh |
| 55 | + (cuda-gdb) print variable_name |
| 56 | + ``` |
| 57 | + |
| 58 | +### Using Nsight Eclipse Edition |
| 59 | + |
| 60 | +Nsight Eclipse Edition provides an integrated development environment for debugging CUDA applications. |
| 61 | + |
| 62 | +1. **Open Nsight Eclipse Edition**. |
| 63 | +2. **Create a Debug Configuration**: |
| 64 | + - Go to **Run > Debug Configurations**. |
| 65 | + - Create a new CUDA C/C++ Application configuration. |
| 66 | + - Set the project and application to your `debapro` executable. |
| 67 | +3. **Set Breakpoints and Start Debugging**. |
| 68 | + |
| 69 | +### Using Nsight Visual Studio Edition |
| 70 | + |
| 71 | +Nsight Visual Studio Edition integrates with Visual Studio to provide debugging capabilities for CUDA applications. |
| 72 | + |
| 73 | +1. **Open Visual Studio**. |
| 74 | +2. **Create a Debug Configuration**: |
| 75 | + - Go to **Debug > Attach to Process**. |
| 76 | + - Select the CUDA application you want to debug. |
| 77 | +3. **Set Breakpoints and Start Debugging**. |
| 78 | + |
| 79 | +## Profiling CUDA Programs |
| 80 | + |
| 81 | +### Using Nsight Systems |
| 82 | + |
| 83 | +Nsight Systems provides system-wide performance analysis, helping you identify bottlenecks in your application. |
| 84 | + |
| 85 | +1. **Run Nsight Systems**: |
| 86 | + ```sh |
| 87 | + nsys profile ./debapro |
| 88 | + ``` |
| 89 | + |
| 90 | +2. **Analyze the Report**: |
| 91 | + - This command generates a report file (e.g., `report.qdrep`). |
| 92 | + - Open this file in the Nsight Systems GUI for detailed analysis. |
| 93 | + |
| 94 | +### Using Nsight Compute |
| 95 | + |
| 96 | +Nsight Compute provides detailed analysis of CUDA kernel performance. |
| 97 | + |
| 98 | +1. **Run Nsight Compute**: |
| 99 | + ```sh |
| 100 | + ncu ./debapro |
| 101 | + ``` |
| 102 | + |
| 103 | +2. **Analyze the Output**: |
| 104 | + - Nsight Compute will provide detailed metrics about the kernel execution. |
| 105 | + |
| 106 | +## Sample Code |
| 107 | + |
| 108 | +Here is a sample CUDA program for matrix multiplication: |
| 109 | + |
| 110 | +```cpp |
| 111 | +#include <iostream> |
| 112 | +#include <cuda_runtime.h> |
| 113 | + |
| 114 | +#define N 16 // Size of the matrix (N x N) |
| 115 | + |
| 116 | +__global__ void matrixMul(const float *A, const float *B, float *C, int width) { |
| 117 | + int row = blockIdx.y * blockDim.y + threadIdx.y; |
| 118 | + int col = blockIdx.x * blockDim.x + threadIdx.x; |
| 119 | + float value = 0; |
| 120 | + |
| 121 | + if (row < width && col < width) { |
| 122 | + for (int k = 0; k < width; ++k) { |
| 123 | + value += A[row * width + k] * B[k * width + col]; |
| 124 | + } |
| 125 | + C[row * width + col] = value; |
| 126 | + } |
| 127 | +} |
| 128 | + |
| 129 | +int main() { |
| 130 | + int size = N * N * sizeof(float); |
| 131 | + float h_A[N * N], h_B[N * N], h_C[N * N]; |
| 132 | + |
| 133 | + // Initialize matrices |
| 134 | + for (int i = 0; i < N * N; ++i) { |
| 135 | + h_A[i] = static_cast<float>(rand()) / RAND_MAX; |
| 136 | + h_B[i] = static_cast<float>(rand()) / RAND_MAX; |
| 137 | + } |
| 138 | + |
| 139 | + float *d_A, *d_B, *d_C; |
| 140 | + cudaMalloc((void **)&d_A, size); |
| 141 | + cudaMalloc((void **)&d_B, size); |
| 142 | + cudaMalloc((void **)&d_C, size); |
| 143 | +
|
| 144 | + cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); |
| 145 | + cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); |
| 146 | +
|
| 147 | + dim3 threadsPerBlock(16, 16); |
| 148 | + dim3 blocksPerGrid((N + threadsPerBlock.x - 1) / threadsPerBlock.x, (N + threadsPerBlock.y - 1) / threadsPerBlock.y); |
| 149 | + matrixMul<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); |
| 150 | +
|
| 151 | + cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); |
| 152 | +
|
| 153 | + // Verify the result |
| 154 | + for (int i = 0; i < N; ++i) { |
| 155 | + for (int j = 0; j < N; ++j) { |
| 156 | + float sum = 0; |
| 157 | + for (int k = 0; k < N; ++k) { |
| 158 | + sum += h_A[i * N + k] * h_B[k * N + j]; |
| 159 | + } |
| 160 | + if (fabs(sum - h_C[i * N + j]) > 1e-5) { |
| 161 | + std::cerr << "Result verification failed at element (" << i << ", " << j << ")" << std::endl; |
| 162 | + exit(EXIT_FAILURE); |
| 163 | + } |
| 164 | + } |
| 165 | + } |
| 166 | +
|
| 167 | + std::cout << "Test PASSED" << std::endl; |
| 168 | +
|
| 169 | + cudaFree(d_A); |
| 170 | + cudaFree(d_B); |
| 171 | + cudaFree(d_C); |
| 172 | +
|
| 173 | + return 0; |
| 174 | +} |
| 175 | +``` |
| 176 | +
|
| 177 | +## Resources to check |
| 178 | +
|
| 179 | +- [NVIDIA Developer Blog on Profiling and Debugging CUDA Applications](https://developer.nvidia.com/blog/new-video-tutuorial-profiling-and-debugging-nvidia-cuda-applications/)2 |
| 180 | +- [CUDA Profiler User’s Guide](https://docs.nvidia.com/cuda/profiler-users-guide/index.html)4 |
| 181 | +- [NVIDIA Nsight Developer Tools](https://developer.nvidia.com/nsight-visual-studio-edition)5 |
0 commit comments