Does cudaMalloc allocate contiguous chunks of memory (i.e., physical bytes next to each other)?
I have a piece of CUDA code that simply copies 128 bytes from global device memory to shared memory, using 32 threads. I am trying to find a way to guarantee that this transfer can be completed in one memory transaction of 128 byes. If cudaMalloc allocates contiguous memory blocks, then it can be easily done.
Following is the code:
#include <iostream>
using namespace std;
#define SIZE 32 //SIZE of the array to store in shared memory
#define NUMTHREADS 32
__global__ void copy(uint* memPointer){
extern __shared__ uint bits[];
int tid = threadIdx.x;
bits[tid] = memPointer[tid];
}
int main(){
uint inputData[SIZE];
uint* storedData;
for(int i=0;i<SIZE;i++){
inputData[i] = i;
}
cudaError_t e1=cudaMalloc((void**) &storedData, sizeof(uint)*SIZE);
if(e1 == cudaSuccess){
cudaError_t e3= cudaMemcpy(storedData, inputData, sizeof(uint)*SIZE, cudaMemcpyHostToDevice);
if(e3==cudaSuccess){
copy<<<1,NUMTHREADS, SIZE*4>>>(storedData);
cudaError_t e6 = cudaFree(storedData);
if(e6==cudaSuccess){
}
else{
cout << "Error freeing memory storedData" << e6 << endl;
}
}
else{
cout << "Failed to copy" << " " << e3 << endl;
}
}
else{
cout << "Failed to allocate memory" << " " << e1 << endl;
}
return 0;
}
-
What purpose is that kernel supposed to serve?talonmies– talonmies2012年07月02日 16:25:57 +00:00Commented Jul 2, 2012 at 16:25
-
It's part of the larger code in which I perform some operation on the data. I am trying to optimize individual parts of the code.gmemon– gmemon2012年07月02日 16:35:36 +00:00Commented Jul 2, 2012 at 16:35
-
If the 128 bytes block is 128 byte aligned then this will be done in one transaction. NVIDIA GPUs have a MMU separate from the CPU MMU. All GPU memory operations are done through the GPUs virtual address space. There is no guarantee that blocks larger than a cache line will be in physically contiguous.Greg Smith– Greg Smith2012年07月03日 01:13:59 +00:00Commented Jul 3, 2012 at 1:13
1 Answer 1
Yes, cudaMalloc allocates contiguous chunks of memory. The "Matrix Transpose" example in the SDK (http://developer.nvidia.com/cuda-cc-sdk-code-samples) has a kernel called "copySharedMem" that does almost exactly what you're describing.