4

The cuda profiler manual states that due to the more relaxed coalescing policy, the number of uncoalesced memory transactions will always be zero. But I'm sure that there are still uncoalescing. How to calculate it? Is there any tools or simulator around that can help? Among them, which one seems to be the most accurate? Thanks

asked Mar 18, 2012 at 12:04
1
  • What makes you certain there are uncoalesced memory transactions? Wouldn't it be easier to compare memory throughput of the kernel to a known benchmark like CUDA memcpy? Commented Mar 18, 2012 at 17:08

1 Answer 1

11

In devices 1.0, you had only two options:

  • Memory access is coalesced and all data is fetched in one memory transaction
  • Memory access is uncoalesced and data is fetched one-by-one - hence, always 16 memory transactions (half-warp).

In devices 1.2 and 1.3 however this is done differently. Imagine your device memory divided into chunks of 128 bytes each. You need as many memory transactions as the number of chunks you hit. So:

  • if you get perfectly coalesced access, you get 1 memory transaction
  • if you just misalign you may get 2 memory transactions
  • if every thread access every n-th word, you can get 3, 4, or even more memory transactions
  • in worst case you can get 16 memory transactions
  • but even if access is somewhat random, but localised, two threads may happen to fall into the same chunk and you will need less than 16 memory transactions

There are so many cases, so putting it into just 2 categories: coalesced/uncoalesced does not make any sense anymore. That is why, the Cuda Profiler went a different way. They simply count the number of memory transactions. The more random your access pattern is, the higher memory transaction count, even if you have the same count of memory access instructions.

The above is slightly simplified model. In reality, memory transaction can access 128-byte, 64-byte or 32-byte wide chunk - to save up bandwidth. Look for columns load 128b, load 64b, load 32b, and store 128b, store 64b, store 32b in your profiler.

Ashwin Nanjappa
79k91 gold badges223 silver badges298 bronze badges
answered Mar 18, 2012 at 12:52
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for a detailed answer. Just another thought though: assuming the memory access patterns are similar across all the warps; if we directly devide the counter gld_coherent to the counter gld_request, we get the number of (uncoalesced) memory transactions per warp?

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.