1

I have a loop where I launch multiple kernels with interdependencies using events and streams.

Here’s the original loop without CUDA graphs:

for (int i= 1; i<= 1024 ; i++) {
 // origin stream
 kernel1<<<1,512,0,stream1>>>(i , /*params*/);
 // fork into stream 2
 cudaEventRecord(event1, stream1);
 cudaStreamWaitEvent(stream2, event1, 0);
 kernel2<<<1,512,0,stream1>>>(i , /*params*/);
 kernel3<<<gridDim, blockDim, 0, stream2>>>(i , /*params*/);
 // join stream 2 back to origin
 cudaEventRecord(event2, stream2);
 cudaStreamWaitEvent(stream1, event2, 0);
}

To decrease the overhead of multiple kernel launches, I chose to use CUDA graphs.

I have dynamic parameters for the kernel , and was confused how to capture inter-dependent streams with dynamic parameters inside a kernel and posted a question on this.

Based on suggestions from comments, I captured the entire loop in a CUDA graph.

Here is the graph code with start and stop events included:

CUDA_CHECK(cudaEventRecord(start, stream1));
CUDA_CHECK(cudaStreamBeginCapture(stream1, cudaStreamCaptureModeGlobal));
for (int i= 1; i<= 1024 ; i++) { 
 // origin stream
 kernel1<<<1,512,0,stream1>>>(i , /*params*/);
 // fork into stream 2
 cudaEventRecord(event1, stream1);
 cudaStreamWaitEvent(stream2, event1, 0);
 kernel2<<<1,512,0,stream1>>>(i , /*params*/);
 kernel3<<<gridDim, blockDim, 0, stream2>>>(i , /*params*/);
 // join stream 2 back to origin
 cudaEventRecord(event2, stream2);
 cudaStreamWaitEvent(stream1, event2, 0);
}
CUDA_CHECK(cudaStreamEndCapture(stream1, &graph));
CUDA_CHECK(cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0));
CUDA_CHECK(cudaGraphLaunch(graphExec, stream1));
// Record the stop event before synchronizing
CUDA_CHECK(cudaEventRecord(stop, stream1));
CUDA_CHECK(cudaEventSynchronize(stop));

However, this approach is taking much more time than the original loop. When I used Nsight Systems to profile the application, it showed that cudaEventSynchronize(stop) is taking a significant amount of time.

What could be causing this increased time, and how can I optimize the graph execution to reduce the synchronization time?

Trace of the CUDA Graph creation and launch. Screenshot from Nsight Systems

Update: Trace of initial version without CUDA graphs Screenshot from Nsight Systems

einpoklum
138k86 gold badges448 silver badges922 bronze badges
asked Jun 20, 2024 at 15:55
7
  • Could you also include a trace of the initial version not using a CUDA Graph? Is it actually the execution of the graph that takes longer than the execution of the kernels previously or is it the creation and instantiation of the graph that causes the worse performance? The latter would be somewhat expected as CUDA Graphs only make sense when run (significantly) more than once. If that isn't given in your use-case, a smaller CUDA Graph of a single or few iterations might be favorable which brings us back to the previous question. Commented Jun 20, 2024 at 16:23
  • @paleonix I didn't understand your question , before CUDA graphs usage , total time recorded was around 100ms , now its like 180ms and the major contributor is cudaEventSynchronize , I have added the trace of initial version without using CUDA graph Commented Jun 20, 2024 at 16:36
  • 1
    Synchronization itself is not really the contributor, it is waiting for the work on the GPU to complete. In the trace you can see how much time the actual computations on the GPU need. My question is if that time is actually higher for some weird reason or if the performance difference is caused by overheads from CUDA Graph creation/instantiation. Commented Jun 20, 2024 at 16:40
  • @paleonix from the trace of both , CUDA Graph creation/instantiation is taking time almost similar to sum of all three kernels , so , yes it is higher for some weird reason I guess Commented Jun 20, 2024 at 16:48
  • From the trace of the initial setup without CUDA Graphs, it looks like your use case is not restricted by launch overheads. They seem to be already hidden behind the computation. Cases that really profit from CUDA Graphs show significant breaks in GPU activity because new work isn't launched fast enough from the CPU. I guess kernel3 is big enough that you don't need CUDA Graphs. Commented Jun 20, 2024 at 16:49

1 Answer 1

2

This will only be a partial answer to your question.

Capture fewer operations, execute captured graph many times

The graph mechanism is only supposed to provide any benefit over "direct" (=graph-less) scheduling of CUDA work to streams, if you execute instances of the same graph multiple times.

In your example, you capture the many iterations of your loop, and execute the graph just once. It is to be expected that, in this situation, the execution time would be worse with graphs. If there is to be any benefit from using CUDA graphs, it would be from capturing one (or few) iteration of your loop and executing it multiple times. I wonder what's the optimal number of iterations to capture... it might be something other than just one. Perhaps the square-root of the total number?

All that being said - you complain about significantly worse execution time, and that I can't quite explain.

answered Jun 21, 2024 at 16:58
Sign up to request clarification or add additional context in comments.

1 Comment

plain = no CUDA graphs. See edit.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.