I have a loop where I launch multiple kernels with interdependencies using events and streams.
Here’s the original loop without CUDA graphs:
for (int i= 1; i<= 1024 ; i++) {
// origin stream
kernel1<<<1,512,0,stream1>>>(i , /*params*/);
// fork into stream 2
cudaEventRecord(event1, stream1);
cudaStreamWaitEvent(stream2, event1, 0);
kernel2<<<1,512,0,stream1>>>(i , /*params*/);
kernel3<<<gridDim, blockDim, 0, stream2>>>(i , /*params*/);
// join stream 2 back to origin
cudaEventRecord(event2, stream2);
cudaStreamWaitEvent(stream1, event2, 0);
}
To decrease the overhead of multiple kernel launches, I chose to use CUDA graphs.
I have dynamic parameters for the kernel , and was confused how to capture inter-dependent streams with dynamic parameters inside a kernel and posted a question on this.
Based on suggestions from comments, I captured the entire loop in a CUDA graph.
Here is the graph code with start and stop events included:
CUDA_CHECK(cudaEventRecord(start, stream1));
CUDA_CHECK(cudaStreamBeginCapture(stream1, cudaStreamCaptureModeGlobal));
for (int i= 1; i<= 1024 ; i++) {
// origin stream
kernel1<<<1,512,0,stream1>>>(i , /*params*/);
// fork into stream 2
cudaEventRecord(event1, stream1);
cudaStreamWaitEvent(stream2, event1, 0);
kernel2<<<1,512,0,stream1>>>(i , /*params*/);
kernel3<<<gridDim, blockDim, 0, stream2>>>(i , /*params*/);
// join stream 2 back to origin
cudaEventRecord(event2, stream2);
cudaStreamWaitEvent(stream1, event2, 0);
}
CUDA_CHECK(cudaStreamEndCapture(stream1, &graph));
CUDA_CHECK(cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0));
CUDA_CHECK(cudaGraphLaunch(graphExec, stream1));
// Record the stop event before synchronizing
CUDA_CHECK(cudaEventRecord(stop, stream1));
CUDA_CHECK(cudaEventSynchronize(stop));
However, this approach is taking much more time than the original loop. When I used Nsight Systems to profile the application, it showed that cudaEventSynchronize(stop) is taking a significant amount of time.
What could be causing this increased time, and how can I optimize the graph execution to reduce the synchronization time?
Trace of the CUDA Graph creation and launch. Screenshot from Nsight Systems
Update: Trace of initial version without CUDA graphs Screenshot from Nsight Systems
1 Answer 1
This will only be a partial answer to your question.
Capture fewer operations, execute captured graph many times
The graph mechanism is only supposed to provide any benefit over "direct" (=graph-less) scheduling of CUDA work to streams, if you execute instances of the same graph multiple times.
In your example, you capture the many iterations of your loop, and execute the graph just once. It is to be expected that, in this situation, the execution time would be worse with graphs. If there is to be any benefit from using CUDA graphs, it would be from capturing one (or few) iteration of your loop and executing it multiple times. I wonder what's the optimal number of iterations to capture... it might be something other than just one. Perhaps the square-root of the total number?
All that being said - you complain about significantly worse execution time, and that I can't quite explain.
cudaEventSynchronize, I have added the trace of initial version without using CUDA graphkernel3is big enough that you don't need CUDA Graphs.