|
| 1 | +# Chapter 2: Simple CUDA Vector Addition |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +In this chapter, we'll create a basic CUDA program that performs vector addition using the GPU. The program consists of a CUDA kernel that adds corresponding elements from two input vectors and stores the result in an output vector. |
| 6 | + |
| 7 | +### Code Explanation |
| 8 | + |
| 9 | +Let's go through the code step by step: |
| 10 | + |
| 11 | +1. **Kernel Function (`vectorAdd`)**: |
| 12 | + - The `vectorAdd` function is the heart of our CUDA program. It runs on the GPU and performs the vector addition. |
| 13 | + - It takes four arguments: |
| 14 | + - `A`: Pointer to the first input vector. |
| 15 | + - `B`: Pointer to the second input vector. |
| 16 | + - `C`: Pointer to the output vector (where the result will be stored). |
| 17 | + - `size`: Size of the vectors (number of elements). |
| 18 | + - Inside the kernel, each thread computes the sum of corresponding elements from `A` and `B` and stores the result in `C`. |
| 19 | + |
| 20 | +2. **Main Function**: |
| 21 | + - The `main` function sets up the host (CPU) and device (GPU) memory, initializes input vectors, launches the kernel, and retrieves the result. |
| 22 | + - Key steps in the `main` function: |
| 23 | + - Allocate memory for vectors (`h_A`, `h_B`, and `h_C`) on the host. |
| 24 | + - Initialize input vectors (`h_A` and `h_B`) with sample values. |
| 25 | + - Allocate memory for vectors (`d_A`, `d_B`, and `d_C`) on the device (GPU). |
| 26 | + - Copy data from host to device using `cudaMemcpy`. |
| 27 | + - Launch the `vectorAdd` kernel with appropriate block and grid dimensions. |
| 28 | + - Copy the result back from the device to the host. |
| 29 | + - Print the result (output vector `h_C`). |
| 30 | + |
| 31 | +3. **Memory Allocation and Transfer**: |
| 32 | + - We allocate memory for vectors on both the host and the device. |
| 33 | + - `cudaMalloc` allocates memory on the device. |
| 34 | + - `cudaMemcpy` transfers data between host and device. |
| 35 | + |
| 36 | +4. **Kernel Launch**: |
| 37 | + - We launch the `vectorAdd` kernel using `<<<numBlocks, threadsPerBlock>>>` syntax. |
| 38 | + - `numBlocks` and `threadsPerBlock` determine the grid and block dimensions. |
| 39 | + |
| 40 | +5. **Clean Up**: |
| 41 | + - We free the allocated device memory using `cudaFree`. |
| 42 | + - We also delete the host vectors (`h_A`, `h_B`, and `h_C`) to avoid memory leaks. |
| 43 | + |
| 44 | +## How to Compile and Run |
| 45 | + |
| 46 | +1. **Compile the Code**: |
| 47 | + - Open your terminal or command prompt. |
| 48 | + - Navigate to the folder containing `vector_addition.cu`. |
| 49 | + - Compile the code using `nvcc` (NVIDIA CUDA Compiler): |
| 50 | + ``` |
| 51 | + nvcc vector_addition.cu -o vector_addition |
| 52 | + ``` |
| 53 | +  |
| 54 | + |
| 55 | +2. **Run the Executable**: |
| 56 | + - Execute the compiled binary: |
| 57 | + ``` |
| 58 | + ./vector_addition |
| 59 | + ``` |
| 60 | + - You'll see the result of vector addition printed to the console. |
| 61 | + |
| 62 | +  |
| 63 | + |
0 commit comments