CUDA Graph Execution Taking Longer Than Original Kernel Launch Loop

Question 1

I have a loop where I launch multiple kernels with interdependencies using events and streams.

Here’s the original loop without CUDA graphs:

for (int i= 1; i<= 1024 ; i++) {
 // origin stream
 kernel1<<<1,512,0,stream1>>>(i , /*params*/);
 // fork into stream 2
 cudaEventRecord(event1, stream1);
 cudaStreamWaitEvent(stream2, event1, 0);
 kernel2<<<1,512,0,stream1>>>(i , /*params*/);
 kernel3<<<gridDim, blockDim, 0, stream2>>>(i , /*params*/);
 // join stream 2 back to origin
 cudaEventRecord(event2, stream2);
 cudaStreamWaitEvent(stream1, event2, 0);
}

To decrease the overhead of multiple kernel launches, I chose to use CUDA graphs.

I have dynamic parameters for the kernel , and was confused how to capture inter-dependent streams with dynamic parameters inside a kernel and posted a question on this.

Based on suggestions from comments, I captured the entire loop in a CUDA graph.

Here is the graph code with start and stop events included:

CUDA_CHECK(cudaEventRecord(start, stream1));
CUDA_CHECK(cudaStreamBeginCapture(stream1, cudaStreamCaptureModeGlobal));
for (int i= 1; i<= 1024 ; i++) { 
 // origin stream
 kernel1<<<1,512,0,stream1>>>(i , /*params*/);
 // fork into stream 2
 cudaEventRecord(event1, stream1);
 cudaStreamWaitEvent(stream2, event1, 0);
 kernel2<<<1,512,0,stream1>>>(i , /*params*/);
 kernel3<<<gridDim, blockDim, 0, stream2>>>(i , /*params*/);
 // join stream 2 back to origin
 cudaEventRecord(event2, stream2);
 cudaStreamWaitEvent(stream1, event2, 0);
}
CUDA_CHECK(cudaStreamEndCapture(stream1, &graph));
CUDA_CHECK(cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0));
CUDA_CHECK(cudaGraphLaunch(graphExec, stream1));
// Record the stop event before synchronizing
CUDA_CHECK(cudaEventRecord(stop, stream1));
CUDA_CHECK(cudaEventSynchronize(stop));

However, this approach is taking much more time than the original loop. When I used Nsight Systems to profile the application, it showed that cudaEventSynchronize(stop) is taking a significant amount of time.

What could be causing this increased time, and how can I optimize the graph execution to reduce the synchronization time?

Trace of the CUDA Graph creation and launch. Screenshot from Nsight Systems

Update: Trace of initial version without CUDA graphs Screenshot from Nsight Systems

Question 2

Could you also include a trace of the initial version not using a CUDA Graph? Is it actually the execution of the graph that takes longer than the execution of the kernels previously or is it the creation and instantiation of the graph that causes the worse performance? The latter would be somewhat expected as CUDA Graphs only make sense when run (significantly) more than once. If that isn't given in your use-case, a smaller CUDA Graph of a single or few iterations might be favorable which brings us back to the previous question.

Question 3

@paleonix I didn't understand your question , before CUDA graphs usage , total time recorded was around 100ms , now its like 180ms and the major contributor is cudaEventSynchronize , I have added the trace of initial version without using CUDA graph

Question 4

Synchronization itself is not really the contributor, it is waiting for the work on the GPU to complete. In the trace you can see how much time the actual computations on the GPU need. My question is if that time is actually higher for some weird reason or if the performance difference is caused by overheads from CUDA Graph creation/instantiation.

Question 5

@paleonix from the trace of both , CUDA Graph creation/instantiation is taking time almost similar to sum of all three kernels , so , yes it is higher for some weird reason I guess

Question 6

From the trace of the initial setup without CUDA Graphs, it looks like your use case is not restricted by launch overheads. They seem to be already hidden behind the computation. Cases that really profit from CUDA Graphs show significant breaks in GPU activity because new work isn't launched fast enough from the CPU. I guess kernel3 is big enough that you don't need CUDA Graphs.

Question 7

This will only be a partial answer to your question.

Capture fewer operations, execute captured graph many times

The graph mechanism is only supposed to provide any benefit over "direct" (=graph-less) scheduling of CUDA work to streams, if you execute instances of the same graph multiple times.

In your example, you capture the many iterations of your loop, and execute the graph just once. It is to be expected that, in this situation, the execution time would be worse with graphs. If there is to be any benefit from using CUDA graphs, it would be from capturing one (or few) iteration of your loop and executing it multiple times. I wonder what's the optimal number of iterations to capture... it might be something other than just one. Perhaps the square-root of the total number?

All that being said - you complain about significantly worse execution time, and that I can't quite explain.

Question 8

plain = no CUDA graphs. See edit.

einpoklum 138k86 gold badges448 silver badges922 bronze badges · Accepted Answer · 2024-06-21 16:58:18Z

This will only be a partial answer to your question.

Capture fewer operations, execute captured graph many times

The graph mechanism is only supposed to provide any benefit over "direct" (=graph-less) scheduling of CUDA work to streams, if you execute instances of the same graph multiple times.

In your example, you capture the many iterations of your loop, and execute the graph just once. It is to be expected that, in this situation, the execution time would be worse with graphs. If there is to be any benefit from using CUDA graphs, it would be from capturing one (or few) iteration of your loop and executing it multiple times. I wonder what's the optimal number of iterations to capture... it might be something other than just one. Perhaps the square-root of the total number?

All that being said - you complain about significantly worse execution time, and that I can't quite explain.

CollectivesTM on Stack Overflow

CUDA Graph Execution Taking Longer Than Original Kernel Launch Loop

1 Answer 1

Capture fewer operations, execute captured graph many times

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Capture fewer operations, execute captured graph many times

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related