23 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
2
votes
1
answer
199
views
Executing a CUDA Graph from a CUDA kernel
I’m trying to launch a captured CUDA Graph from inside a regular CUDA kernel (i.e., device-side graph launch).
From the NVIDIA blog on device graph launch, it seems this should be supported on newer ...
0
votes
1
answer
204
views
Trying to end stream capture fails due to "unjoined work"; but synchronizing fails when capture is in progress
I am capturing some work on a CUDA stream, then instantiating the resulting graph (or graph template rather) and running it again. I am running CUDA 12.6.85 with driver version 535.54.03 (the driver ...
0
votes
2
answers
308
views
CUDA graph cudaKernelNodeParams kernelParams
I want to use CUDA Graph in my CUDA project, but there aren’t many complete examples online. So, I directly referred to the official API to implement it, but I keep encountering a segmentation fault. ...
1
vote
1
answer
770
views
CUDA Graph Execution Taking Longer Than Original Kernel Launch Loop
I have a loop where I launch multiple kernels with interdependencies using events and streams.
Here’s the original loop without CUDA graphs:
for (int i= 1; i<= 1024 ; i++) {
// origin stream
...
2
votes
2
answers
879
views
How to Use CUDA Graphs with Interdependent Streams and Dynamic Parameters?
I have a CUDA program with multiple interdependent streams, and I want to convert it to use CUDA graphs to reduce launch overhead and improve performance. My program involves launching three kernels (...
0
votes
1
answer
163
views
Behavior of cudaGraphInstantiateFlagUseNodePriority
My understanding of cudaGraphInstantiateFlagUseNodePriority is to prioritize the kernel calls.
i.e. we have three independent kernels in cudaGraph first, second & third, and let each kernel waits ...
0
votes
1
answer
788
views
Is it possible to execute more than one CUDA graph's host execution node in different streams concurrently?
Investigating possible solutions for this problem, I thought about using CUDA graphs' host execution nodes (cudaGraphAddHostNode). I was hoping to have the option to block and unblock streams on the ...
3
votes
1
answer
593
views
Catching an exception thrown from a callback in cudaLaunchHostFunc
I want to check for an error flag living in managed memory that might have been written by a kernel running on a certain stream. Depending on the error flag I need to throw an exception.
I would ...
0
votes
1
answer
115
views
What should I set the flags field of CUDA_BATCH_MEM_OP_NODE_PARAMS?
The CUDA graph API exposes a function call for adding a "batch memory operations" node to a graph:
CUresult cuGraphAddBatchMemOpNode (
CUgraphNode* phGraphNode,
CUgraph hGraph,
...
0
votes
1
answer
59
views
What type should be pointed to for the result of cuDeviceGetGraphMemAttribute()?
cuDeviceGetGraphMemAttribute() takes a void pointer to a result variable. But - what type does it expect the pointed-to value to be? The documentation (for CUDA v12.0) doesn't say. I'm guessing it's ...
0
votes
1
answer
63
views
How can I tell whether a copy-node search failed, or whether my node or graph are invalid?
Consider the CUDA graphs API function cuFindNodeInClone(). The documentation says, that it:
Returns:
CUDA_SUCCESS, CUDA_ERROR_INVALID_VALUE
This seems problematic to me. How can I tell whether the ...
-1
votes
1
answer
489
views
CUDA graph does not run as expected
I'm using the following the code to learn about how to use "CUDA graphs". The parameter NSTEP is set as 1000, and the parameter NKERNEL is set as 20. The kernel function shortKernel has ...
0
votes
1
answer
646
views
simple cuda graph example doesn't product expected result
I am testing out cuda graphs. My graph is as follows.
the code for this is as follows
#include <cstdio>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <...
0
votes
1
answer
3k
views
Error with a captured CUDA graph and asynchronous memory allocations in a loop [closed]
I am trying to implement a cuda graph experiment. There are three kernels, kernel_0, kernel_1, and kernel_2. They will be executed sequentially and have dependencies. Right now I am going to only ...
2
votes
2
answers
2k
views
Using multi streams in cuda graph, the execution order is uncontrolled
I am using cuda graph stream capture API to implement a small demo with multi streams. Referenced by the CUDA Programming Guide here, I wrote the complete code. In my knowledge, kernelB should execute ...