The number of coalesced and uncoalesced memory transactions in gpu compute 1.3

Question 1

The cuda profiler manual states that due to the more relaxed coalescing policy, the number of uncoalesced memory transactions will always be zero. But I'm sure that there are still uncoalescing. How to calculate it? Is there any tools or simulator around that can help? Among them, which one seems to be the most accurate? Thanks

Question 2

What makes you certain there are uncoalesced memory transactions? Wouldn't it be easier to compare memory throughput of the kernel to a known benchmark like CUDA memcpy?

Question 3

In devices 1.0, you had only two options:

Memory access is coalesced and all data is fetched in one memory transaction
Memory access is uncoalesced and data is fetched one-by-one - hence, always 16 memory transactions (half-warp).

In devices 1.2 and 1.3 however this is done differently. Imagine your device memory divided into chunks of 128 bytes each. You need as many memory transactions as the number of chunks you hit. So:

if you get perfectly coalesced access, you get 1 memory transaction
if you just misalign you may get 2 memory transactions
if every thread access every n-th word, you can get 3, 4, or even more memory transactions
in worst case you can get 16 memory transactions
but even if access is somewhat random, but localised, two threads may happen to fall into the same chunk and you will need less than 16 memory transactions

There are so many cases, so putting it into just 2 categories: coalesced/uncoalesced does not make any sense anymore. That is why, the Cuda Profiler went a different way. They simply count the number of memory transactions. The more random your access pattern is, the higher memory transaction count, even if you have the same count of memory access instructions.

The above is slightly simplified model. In reality, memory transaction can access 128-byte, 64-byte or 32-byte wide chunk - to save up bandwidth. Look for columns load 128b, load 64b, load 32b, and store 128b, store 64b, store 32b in your profiler.

Question 4

Thanks for a detailed answer. Just another thought though: assuming the memory access patterns are similar across all the warps; if we directly devide the counter gld_coherent to the counter gld_request, we get the number of (uncoalesced) memory transactions per warp?

CygnusX1 22.1k5 gold badges76 silver badges121 bronze badges · Accepted Answer · 2012-03-18 12:52:58Z

In devices 1.0, you had only two options:

Memory access is coalesced and all data is fetched in one memory transaction
Memory access is uncoalesced and data is fetched one-by-one - hence, always 16 memory transactions (half-warp).

In devices 1.2 and 1.3 however this is done differently. Imagine your device memory divided into chunks of 128 bytes each. You need as many memory transactions as the number of chunks you hit. So:

if you get perfectly coalesced access, you get 1 memory transaction
if you just misalign you may get 2 memory transactions
if every thread access every n-th word, you can get 3, 4, or even more memory transactions
in worst case you can get 16 memory transactions
but even if access is somewhat random, but localised, two threads may happen to fall into the same chunk and you will need less than 16 memory transactions

There are so many cases, so putting it into just 2 categories: coalesced/uncoalesced does not make any sense anymore. That is why, the Cuda Profiler went a different way. They simply count the number of memory transactions. The more random your access pattern is, the higher memory transaction count, even if you have the same count of memory access instructions.

The above is slightly simplified model. In reality, memory transaction can access 128-byte, 64-byte or 32-byte wide chunk - to save up bandwidth. Look for columns load 128b, load 64b, load 32b, and store 128b, store 64b, store 32b in your profiler.

Thanks for a detailed answer. Just another thought though: assuming the memory access patterns are similar across all the warps; if we directly devide the counter gld_coherent to the counter gld_request, we get the number of (uncoalesced) memory transactions per warp?

CollectivesTM on Stack Overflow

The number of coalesced and uncoalesced memory transactions in gpu compute 1.3

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related