Extending dpctl to support CUDA · IntelPython/dpctl · Discussion #1124

diptorupd
Mar 14, 2023

oneAPI 2023.0 supports CUDA devices using the "oneAPI for NVIDIA GPUs 2023.0" plugin. I am starting this exploratory discussion to evaluate the requirements and scope of work to support CUDA in dpctl via the oneAPI plugin.

Here are the findings from my initial exploration:

System information:

OS: Ubuntu 22.04 Jammy
CUDA GPU: NVIDIA GeForce GTX 1660 Ti card
CUDA 11.4.
CUDA Driver Version: 470.161.03

Initial setup steps:

a) Installed oneAPI following the installation guide

NOTE: Watch out for installation issues on Ubuntu 22.04 (cstddef.h not found etc.) to work around do sudo apt install libstdc++-12-dev.

b) I already had CUDA set up and I had followed the CUDA guide to install on my OS

c) Downloaded the oneAPI for NVIDIA GPUs plugin and followed the installation guide

NOTE: If you have multiple type of devices on the system (I have openCL GPU driver and L0 GPU driver for a gen9 integrated GPU, openCL CPU driver for a gen9 CPU, and CUDA), you can compile the simple-sycl-app.cpp from the get-started-guide with multiple -fsycl-targets, e.g., -fsycl-targets=nvptx64-nvidia-cuda, spir64-unknown-unknown. Once that is done, you can execute the simple-sycl-app for both CUDA and other devices simply by changing SYCL_DEVICE_FILTER.

Building dpctl with CUDA

a) Build dpctl with the customized oneAPI. The process for me was just to run python scripts/build_locally.py.

NOTE: be sure to remove the dpcpp-cpp-rt and dpcpp-linux_64 conda packages if you are building inside a conda environment.

Testing the install

a) After building and installing dpctl using the build_locally.py script. I tried the following:

>>> import dpctl
>>> dpctl.lsplatform()
Intel(R) FPGA Emulation Platform for OpenCL(TM) OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3
Intel(R) OpenCL OpenCL 3.0 LINUX
Intel(R) OpenCL HD Graphics OpenCL 3.0 
Intel(R) Level-Zero 1.3
NVIDIA CUDA BACKEND CUDA 11.4

So far so good, the CUDA GPU is detected as expected.

b) Creating a CUDA stream:

>>> q = dpctl.SyclQueue("cuda")
>>> q.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu, NVIDIA GeForce GTX 1660 Ti] at 0x7f9e9287bf70>
>>> q.sycl_device.print_device_info()
 Name NVIDIA GeForce GTX 1660 Ti
 Driver version CUDA 11.4
 Vendor NVIDIA Corporation
 Filter string cuda:gpu:0

c) Try a basic tensor creation:

>>> import dpctl.tensor as dpt
>>> a = dpt.empty(10, device="cuda")
>>> a.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu, NVIDIA GeForce GTX 1660 Ti] at 0x7f9e92fc5e70>
>>> print(a)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> a
usm_ndarray([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
>>> a.usm_type
'device'

Initial thoughts

I went much farther than I had hoped to get. The plugin seamlessly exposed the CUDA device, queue creation works and even memory allocation seems to have succeeded.

The next steps will be to test some basic operations on the tensor. @oleksandr-pavlyk can you suggest something? Although, I doubt that will work out of the box. I think we will need to build dpctl with -fsycl-targets=nvptx64-nvidia-cuda.

Answered by diptorupd

Mar 15, 2023

@oleksandr-pavlyk As expected, running a kernel with a default compiled dpctl as-is will not work:

>>> a = dpt.arange(30, device=dev); b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b]
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/home/diptorupd/Desktop/devel/dpctl/dpctl/tensor/_ctors.py", line 642, in arange
 hev, _ = ti._linspace_step(_start, _step, res, sycl_queue)
RuntimeError: Native API failed. Native API returns: -42 (PI_ERROR_INVALID_BINARY) -42 (PI_ERROR_INVALID_BINARY)

However, after the following small patch

Author: Diptorup Deb <diptorup.deb@intel.com> 2023年03月14日 ...

View full answer

Replies: 3 comments 2 replies

oleksandr-pavlyk
Mar 15, 2023

@diptorupd Try dev = "cuda"; a = dpt.arange(30, device=dev); b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b];

For dev="gpu" I get:


In [4]: dev = "gpu"
In [5]: a = dpt.arange(30, device=dev); b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b]
In [6]: c
Out[6]: usm_ndarray([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22])

0 replies

diptorupd
Mar 15, 2023
Author

@oleksandr-pavlyk As expected, running a kernel with a default compiled dpctl as-is will not work:

>>> a = dpt.arange(30, device=dev); b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b]
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/home/diptorupd/Desktop/devel/dpctl/dpctl/tensor/_ctors.py", line 642, in arange
 hev, _ = ti._linspace_step(_start, _step, res, sycl_queue)
RuntimeError: Native API failed. Native API returns: -42 (PI_ERROR_INVALID_BINARY) -42 (PI_ERROR_INVALID_BINARY)

However, after the following small patch

Author: Diptorup Deb <diptorup.deb@intel.com> 2023年03月14日 23:47:18
Committer: Diptorup Deb <diptorup.deb@intel.com> 2023年03月14日 23:47:18
Parent: 8f828f24ada9829ed4d9d5dc56e6d7f39dd9ac3c (Merge pull request #1118 from IntelPython/fix-build-break)
Branch: demo/cuda-support
Follows: 0.14.2
Precedes: 
 Compile with cuda support
----------------------------- dpctl/CMakeLists.txt -----------------------------
index 6ccca33dd..f8c08f105 100644
@@ -58,6 +58,7 @@ elseif(UNIX)
 "${WARNING_FLAGS}"
 "${SDL_FLAGS}"
 "-fsycl "
+ "-fsycl-targets=nvptx64-nvidia-cuda,spir64-unknown-unknown "
 )
 set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -O3 ${CFLAGS}")
 set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 ${CXXFLAGS}")
--------------------------- scripts/build_locally.py ---------------------------
index ff34c9d18..9c689ead9 100644
@@ -145,7 +145,7 @@ if __name__ == "__main__":
 and args.compiler_root is None
 ):
 args.c_compiler = "icx"
- args.cxx_compiler = "icpx" if "linux" in sys.platform else "icx"
+ args.cxx_compiler = "clang++" if "linux" in sys.platform else "icx"
 args.compiler_root = None
 else:
 cr = args.compiler_root
@@ -153,7 +153,9 @@ if __name__ == "__main__":
 if args.c_compiler is None:
 args.c_compiler = "icx"
 if args.cxx_compiler is None:
- args.cxx_compiler = "icpx" if "linux" in sys.platform else "icx"
+ args.cxx_compiler = (
+ "clang++" if "linux" in sys.platform else "icx"
+ )
 else:
 raise RuntimeError(
 "Option 'compiler-root' must be provided when "

There were a few warnings of the kind: clang++: warning: linked binaries do not contain expected 'nvptx64-nvidia-cuda' target; found targets: 'nvptx64-nvidia-cuda-sm_50, spir64-unknown-unknown' [-Wsycl-target], but we have CUDA support 😄

>>> import dpctl
>>> import dpctl.tensor as dpt
>>> dev = "cuda"
>>> a = dpt.arange(30, device=dev)
>>> a
usm_ndarray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])
>>> print(a)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29]
>>> a.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu, NVIDIA GeForce GTX 1660 Ti] at 0x7f2e89f698b0>
>>> b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b]
>>> c
usm_ndarray([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22])
>>> c.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu, NVIDIA GeForce GTX 1660 Ti] at 0x7f2e89f698b0>

2 replies

@diptorupd

diptorupd Mar 15, 2023
Author

We should now spin up a Docker image with dpctl+CUDA and put it on the repo.

(But for now let me go back on vacation 😉 )

@ogrisel

ogrisel Mar 15, 2023

Great news! Enjoy your time-off!

Answer selected by diptorupd

oleksandr-pavlyk
Oct 27, 2023

With #1411, one can build dpctl using

$ DPCTL_TARGET_CUDA=1 python scripts/build_locally.py --verbose

This creates fat binary with SPV and PTX offload sections. Test suite passes using CUDA backend. Since the GPU at my disposal is weak (GT 1030) I must run each test file individually:

$ ONEAPI_DEVICE_SELECTOR=cuda:gpu find dpctl/tests/ -name "test_*.py" | xargs -n 1 bash -c 'python -m pytest 0ドル --durations=3 || exit 255'

With beefier GPU, running the test suite works out of the box:

$ ONEAPI_DEVICE_SELECTOR=cuda:gpu pytest --pyargs dpctl

@ogrisel @fcharras

0 replies

Extending dpctl to support CUDA #1124

Uh oh!

Uh oh!

diptorupd Mar 14, 2023

Replies: 3 comments · 2 replies

Uh oh!

oleksandr-pavlyk Mar 15, 2023

Uh oh!

diptorupd Mar 15, 2023 Author

Uh oh!

Uh oh!

diptorupd Mar 15, 2023 Author

Uh oh!

ogrisel Mar 15, 2023

Uh oh!

Uh oh!

oleksandr-pavlyk Oct 27, 2023

diptorupd
Mar 14, 2023

Replies: 3 comments 2 replies

oleksandr-pavlyk
Mar 15, 2023

diptorupd
Mar 15, 2023
Author

diptorupd Mar 15, 2023
Author

oleksandr-pavlyk
Oct 27, 2023