3,752 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
-5
votes
1
answer
56
views
Performance Degradation of LAMMPS with Increased MPI Ranks on a A100 GPU [closed]
I tested the performance of LAMMPS with DeepMD-kit for MD simulations on an HPC cluster.
The job was allocated 8 CPUs, 64 GB of RAM, and one A100 GPU.
I observed that when running with mpirun -np 1 ...
Tooling
0
votes
1
replies
60
views
How to dynamically estimate maximum number of cameras my GPU can handle for YOLOv8 inference?
I’m trying to simulate multiple camera streams feeding into a YOLOv8l model on a single GPU and monitor real-time hardware utilization. My setup:
Single GPU (48GB VRAM, CUDA-enabled)
YOLOv8l model
...
1
vote
1
answer
476
views
How to correctly install JAX with CUDA on Linux when `jax[cuda12_pip]` consistently falls back to the CPU version?
I am trying to install JAX with GPU support on a powerful, dedicated Linux server, but I am stuck in what feels like a Catch-22 where every official installation method fails in a different way, ...
3
votes
1
answer
156
views
Unable to run CUDA program in google colab
I am trying to run basic CUDA program in google colab but its not giving kernel output.
Below are the steps what I tried:
Changed run type to T4 GPU.
!pip install nvcc4jupyter
%load_ext ...
1
vote
1
answer
66
views
How to debug cuda in Visual Studio with "step over"
I installed NVIDIA Nsight Visual Studio Edition 2025.01 in Visual Studio 2022.
I want to debug code, but I can't debug with step over(F10), The debugger always stops at a location without a breakpoint....
0
votes
1
answer
465
views
TensorFlow not detecting NVIDIA GPU (RTX 3050, CUDA 12.7, TF 2.20.0) [duplicate]
I’ve been trying to get TensorFlow to use my GPU on Windows, and even though everything seems installed correctly, it shows 0 available GPUs.
System setup
Windows 11
RTX 3050 Laptop GPU
NVIDIA driver ...
1
vote
0
answers
216
views
Why does "Command Buffer Full" appear in PyTorch CUDA kernel launches?
I’m using the PyTorch profiler to analyze sglang, and I noticed that in the CUDA timeline, some kernels show "Command Buffer Full". This causes the cudaLaunchKernel time to become very long, as shown ...
2
votes
0
answers
326
views
jax plugin configuration error: Exception when calling jax_plugins.xla_cuda12.initialize()
I am using WSL2 on windows 10. I have NVIDIA graphics card. I recently installed GPU jax using the command pip install -U "jax[cuda12]". This completed successfully, but when I run any jax ...
2
votes
1
answer
197
views
Executing a CUDA Graph from a CUDA kernel
I’m trying to launch a captured CUDA Graph from inside a regular CUDA kernel (i.e., device-side graph launch).
From the NVIDIA blog on device graph launch, it seems this should be supported on newer ...
0
votes
1
answer
103
views
CPU-GPU producer-consumer pattern using unified memory but GPU is in spin loop
I am trying to implement producer consumer problem in GPU-CPU. Required for some other project. GPU requests some data via Unified memory to CPU. CPU copies that data to a specific location in global ...
3
votes
1
answer
125
views
TensorRT PWC-Net Causing 2.4km Trajectory Error in iSLAM - Original PyTorch Works Fine
Problem Statement
My iSLAM system works correctly with the original PyTorch PWC-Net but produces catastrophic trajectory errors (2.4km ATE RMSE) when I replace it with a TensorRT-converted version. ...
0
votes
0
answers
153
views
TensorRT DLA Engine Build Fails for PWC-Net on Jetson NX - Missing Layer Support?
I'm converting a PWC-Net optical flow model to run on Jetson NX DLA using the iSLAM framework, but the TensorRT engine build fails during DLA optimization.
Environment
Hardware: NVIDIA Jetson NX
...
0
votes
0
answers
73
views
Using a scalar tensor as image source for a Holoscan HolovizOp
I am attempting to write my own holoscan::Operator for creating some images that should be displayed as a short video using a holoscan::ops::HolovizOp.
So I compose()-d an application flow: add_flow(...
0
votes
1
answer
170
views
How to correctly monitor a program’s GPU memory bandwidth utilization and SM utilization? (DCGM DRAM_ACTIVE vs in-program bandwidth differs a lot)
I want to quantitatively measure the memory bandwidth utilization and SM utilization of a CUDA program for performance analysis and regression testing.
My approach so far:
Compute the theoretical ...
2
votes
1
answer
157
views
How to define "pool" for Nvidia holoscan::ops::FormatConverterOp
I am trying to get the holoscan example "bring your own model"
https://docs.nvidia.com/holoscan/sdk-user-guide/examples/byom.html
to run, translating it from Python into CPP.
One necessary ...