Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Extending dpctl to support CUDA #1124

Answered by diptorupd
diptorupd asked this question in Ideas
Discussion options

oneAPI 2023.0 supports CUDA devices using the "oneAPI for NVIDIA GPUs 2023.0" plugin. I am starting this exploratory discussion to evaluate the requirements and scope of work to support CUDA in dpctl via the oneAPI plugin.

Here are the findings from my initial exploration:

System information:

OS: Ubuntu 22.04 Jammy
CUDA GPU: NVIDIA GeForce GTX 1660 Ti card
CUDA 11.4.
CUDA Driver Version: 470.161.03

Initial setup steps:

a) Installed oneAPI following the installation guide

NOTE: Watch out for installation issues on Ubuntu 22.04 (cstddef.h not found etc.) to work around do sudo apt install libstdc++-12-dev.

b) I already had CUDA set up and I had followed the CUDA guide to install on my OS

c) Downloaded the oneAPI for NVIDIA GPUs plugin and followed the installation guide

NOTE: If you have multiple type of devices on the system (I have openCL GPU driver and L0 GPU driver for a gen9 integrated GPU, openCL CPU driver for a gen9 CPU, and CUDA), you can compile the simple-sycl-app.cpp from the get-started-guide with multiple -fsycl-targets, e.g., -fsycl-targets=nvptx64-nvidia-cuda, spir64-unknown-unknown. Once that is done, you can execute the simple-sycl-app for both CUDA and other devices simply by changing SYCL_DEVICE_FILTER.

Building dpctl with CUDA

a) Build dpctl with the customized oneAPI. The process for me was just to run python scripts/build_locally.py.

NOTE: be sure to remove the dpcpp-cpp-rt and dpcpp-linux_64 conda packages if you are building inside a conda environment.

Testing the install

a) After building and installing dpctl using the build_locally.py script. I tried the following:

>>> import dpctl
>>> dpctl.lsplatform()
Intel(R) FPGA Emulation Platform for OpenCL(TM) OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3
Intel(R) OpenCL OpenCL 3.0 LINUX
Intel(R) OpenCL HD Graphics OpenCL 3.0 
Intel(R) Level-Zero 1.3
NVIDIA CUDA BACKEND CUDA 11.4

So far so good, the CUDA GPU is detected as expected.

b) Creating a CUDA stream:

>>> q = dpctl.SyclQueue("cuda")
>>> q.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu, NVIDIA GeForce GTX 1660 Ti] at 0x7f9e9287bf70>
>>> q.sycl_device.print_device_info()
 Name NVIDIA GeForce GTX 1660 Ti
 Driver version CUDA 11.4
 Vendor NVIDIA Corporation
 Filter string cuda:gpu:0

c) Try a basic tensor creation:

>>> import dpctl.tensor as dpt
>>> a = dpt.empty(10, device="cuda")
>>> a.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu, NVIDIA GeForce GTX 1660 Ti] at 0x7f9e92fc5e70>
>>> print(a)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> a
usm_ndarray([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
>>> a.usm_type
'device'

Initial thoughts

I went much farther than I had hoped to get. The plugin seamlessly exposed the CUDA device, queue creation works and even memory allocation seems to have succeeded.

The next steps will be to test some basic operations on the tensor. @oleksandr-pavlyk can you suggest something? Although, I doubt that will work out of the box. I think we will need to build dpctl with -fsycl-targets=nvptx64-nvidia-cuda.

You must be logged in to vote

@oleksandr-pavlyk As expected, running a kernel with a default compiled dpctl as-is will not work:

>>> a = dpt.arange(30, device=dev); b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b]
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/home/diptorupd/Desktop/devel/dpctl/dpctl/tensor/_ctors.py", line 642, in arange
 hev, _ = ti._linspace_step(_start, _step, res, sycl_queue)
RuntimeError: Native API failed. Native API returns: -42 (PI_ERROR_INVALID_BINARY) -42 (PI_ERROR_INVALID_BINARY)

However, after the following small patch

Author: Diptorup Deb <diptorup.deb@intel.com> 2023年03月14日 ...

Replies: 3 comments 2 replies

Comment options

@diptorupd Try dev = "cuda"; a = dpt.arange(30, device=dev); b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b];

For dev="gpu" I get:


In [4]: dev = "gpu"
In [5]: a = dpt.arange(30, device=dev); b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b]
In [6]: c
Out[6]: usm_ndarray([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22])
You must be logged in to vote
0 replies
Comment options

@oleksandr-pavlyk As expected, running a kernel with a default compiled dpctl as-is will not work:

>>> a = dpt.arange(30, device=dev); b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b]
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/home/diptorupd/Desktop/devel/dpctl/dpctl/tensor/_ctors.py", line 642, in arange
 hev, _ = ti._linspace_step(_start, _step, res, sycl_queue)
RuntimeError: Native API failed. Native API returns: -42 (PI_ERROR_INVALID_BINARY) -42 (PI_ERROR_INVALID_BINARY)

However, after the following small patch

Author: Diptorup Deb <diptorup.deb@intel.com> 2023年03月14日 23:47:18
Committer: Diptorup Deb <diptorup.deb@intel.com> 2023年03月14日 23:47:18
Parent: 8f828f24ada9829ed4d9d5dc56e6d7f39dd9ac3c (Merge pull request #1118 from IntelPython/fix-build-break)
Branch: demo/cuda-support
Follows: 0.14.2
Precedes: 
 Compile with cuda support
----------------------------- dpctl/CMakeLists.txt -----------------------------
index 6ccca33dd..f8c08f105 100644
@@ -58,6 +58,7 @@ elseif(UNIX)
 "${WARNING_FLAGS}"
 "${SDL_FLAGS}"
 "-fsycl "
+ "-fsycl-targets=nvptx64-nvidia-cuda,spir64-unknown-unknown "
 )
 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -O3 ${CFLAGS}")
 set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 ${CXXFLAGS}")
--------------------------- scripts/build_locally.py ---------------------------
index ff34c9d18..9c689ead9 100644
@@ -145,7 +145,7 @@ if __name__ == "__main__":
 and args.compiler_root is None
 ):
 args.c_compiler = "icx"
- args.cxx_compiler = "icpx" if "linux" in sys.platform else "icx"
+ args.cxx_compiler = "clang++" if "linux" in sys.platform else "icx"
 args.compiler_root = None
 else:
 cr = args.compiler_root
@@ -153,7 +153,9 @@ if __name__ == "__main__":
 if args.c_compiler is None:
 args.c_compiler = "icx"
 if args.cxx_compiler is None:
- args.cxx_compiler = "icpx" if "linux" in sys.platform else "icx"
+ args.cxx_compiler = (
+ "clang++" if "linux" in sys.platform else "icx"
+ )
 else:
 raise RuntimeError(
 "Option 'compiler-root' must be provided when "

There were a few warnings of the kind: clang++: warning: linked binaries do not contain expected 'nvptx64-nvidia-cuda' target; found targets: 'nvptx64-nvidia-cuda-sm_50, spir64-unknown-unknown' [-Wsycl-target], but we have CUDA support 😄

>>> import dpctl
>>> import dpctl.tensor as dpt
>>> dev = "cuda"
>>> a = dpt.arange(30, device=dev)
>>> a
usm_ndarray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])
>>> print(a)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29]
>>> a.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu, NVIDIA GeForce GTX 1660 Ti] at 0x7f2e89f698b0>
>>> b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b]
>>> c
usm_ndarray([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22])
>>> c.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu, NVIDIA GeForce GTX 1660 Ti] at 0x7f2e89f698b0>
You must be logged in to vote
2 replies
Comment options

We should now spin up a Docker image with dpctl+CUDA and put it on the repo.

(But for now let me go back on vacation 😉 )

Comment options

Great news! Enjoy your time-off!

Answer selected by diptorupd
Comment options

With #1411, one can build dpctl using

$ DPCTL_TARGET_CUDA=1 python scripts/build_locally.py --verbose

This creates fat binary with SPV and PTX offload sections. Test suite passes using CUDA backend. Since the GPU at my disposal is weak (GT 1030) I must run each test file individually:

$ ONEAPI_DEVICE_SELECTOR=cuda:gpu find dpctl/tests/ -name "test_*.py" | xargs -n 1 bash -c 'python -m pytest 0ドル --durations=3 || exit 255'

With beefier GPU, running the test suite works out of the box:

$ ONEAPI_DEVICE_SELECTOR=cuda:gpu pytest --pyargs dpctl

@ogrisel @fcharras

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Ideas
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /