[sycl_ext_oneapi_clock] implement NVPTX case#21280

Open

tdavidcl wants to merge 4 commits intointel:sycl from

tdavidcl:ptx-clock

Open

[sycl_ext_oneapi_clock] implement NVPTX case #21280
tdavidcl wants to merge 4 commits intointel:sycl from
tdavidcl:ptx-clock

Conversation

@tdavidcl

Copy link

@tdavidcl tdavidcl commented Feb 12, 2026 •

edited

Loading

Hi after suggestion from @zjin-lcf here is a PR (context: KhronosGroup/SYCL-Docs#958).
It implements the NVPTX variant of clock() using the %%clock64 special register from PTX.

https://docs.nvidia.com/cuda/archive/10.1/parallel-thread-execution/index.html?utm_source=chatgpt.com#special-registers-clock64

PTX ISA Notes
Introduced in PTX ISA version 2.0.

So it is safe to assume that the register is supported regardless of the PTX version used since intel llvm assume >5.0 if I recall correctly.

reference for usage internally to llvm (on this repo actually, nice :) )

llvm/clang/lib/Headers/__clang_cuda_device_functions.h

Line 1558 in 2705b67

__DEVICE__ long long clock64() { return __nvvm_read_ptx_sreg_clock64(); }

(there is a typo in the PR which is already corrected by a commit, but i don't why it is not updating in the PR ...)

@tdavidcl


 [sycl_ext_oneapi_clock] implement NVPTX case

f6ed79d

@tdavidcl tdavidcl requested a review from a team as a code owner

February 12, 2026 19:16

@tdavidcl tdavidcl requested a review from cperkinsintel

February 12, 2026 19:16

@tdavidcl


 typo

064f8ef

@tdavidcl

Copy link

Author

tdavidcl commented Feb 12, 2026

Also I just found out that there is this file in LLVM libc/src/__support/GPU/utils.h which does define
uint64_t processor_clock() { return __builtin_readcyclecounter(); }
which is used in all test apparently.

We could maybe use that for both Nvidia and AMD since that's what is called within the CI.

@zjin-lcf

Copy link

Contributor

zjin-lcf commented Feb 12, 2026

Thank you. I found some post ROCm/ROCm#1288 that may be related to your comments.

bader

bader reviewed

Feb 12, 2026

View reviewed changes

sycl/include/sycl/ext/oneapi/experimental/clock.hpp Outdated Show resolved Hide resolved

@tdavidcl @bader


 Update sycl/include/sycl/ext/oneapi/experimental/clock.hpp

ba65f4a

Co-authored-by: Alexey Bader <alexey.bader@intel.com>

@tdavidcl

Copy link

Author

tdavidcl commented Feb 12, 2026

Thank you. I found some post ROCm/ROCm#1288 that may be related to your comments.

It seems that the native builtins are better whenever available. I can try to replace the amd & the else branch by __builtin_readcyclecounter then ?

@zjin-lcf

Copy link

Contributor

zjin-lcf commented Feb 12, 2026

@tdavidcl Please give a try for the amd and the else branch. Thanks.

@tdavidcl


 use __builtin_readcyclecounter as fallback

b43c2b8

@tdavidcl

Copy link

Author

tdavidcl commented Feb 13, 2026

I've added it now it needs a bit of testing. I do not have access to a AMD GPU right now though. The best way of action would be probably a simple test to check that it compiles in all configurations + check that the return is both non zero and monotonically increase in subsequent calls. Where is the best spot to add such a test ?

@KornevNikita

Copy link

Contributor

KornevNikita commented Feb 16, 2026

I've added it now it needs a bit of testing. I do not have access to a AMD GPU right now though. The best way of action would be probably a simple test to check that it compiles in all configurations + check that the return is both non zero and monotonically increase in subsequent calls. Where is the best spot to add such a test ?

https://github.com/intel/llvm/tree/sycl/sycl/test-e2e/Clock

@tdavidcl tdavidcl temporarily deployed to WindowsCILock

February 16, 2026 17:12

— with GitHub Actions Inactive

@KornevNikita

Copy link

Contributor

KornevNikita commented Feb 16, 2026

@tdavidcl thanks for working on this! Also, these functions require device to support aspects:

llvm/sycl/include/sycl/ext/oneapi/experimental/clock.hpp

Lines 47 to 49 in 1a77753

#ifdef __SYCL_DEVICE_ONLY__

[[__sycl_detail__::__uses_aspects__(sycl::aspect::ext_oneapi_clock_device)]]

#endif

That means we also need something like this but for CUDA adapter.

@tdavidcl tdavidcl temporarily deployed to WindowsCILock

February 16, 2026 17:39

— with GitHub Actions Inactive

@tdavidcl tdavidcl temporarily deployed to WindowsCILock

February 16, 2026 17:39

— with GitHub Actions Inactive

@tdavidcl tdavidcl temporarily deployed to WindowsCILock

February 16, 2026 17:39

— with GitHub Actions Inactive

KornevNikita

KornevNikita reviewed

Feb 16, 2026

View reviewed changes

sycl/include/sycl/ext/oneapi/experimental/clock.hpp

// this is due to potential higher overhead compared to a native API call

// see : https://github.com/ROCm/ROCm/issues/1288

#if defined(__NVPTX__)

if constexpr (Scope == work_group || Scope == sub_group) {

Copy link

Contributor

@KornevNikita KornevNikita Feb 16, 2026 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

if constexpr (Scope == work_group || Scope == sub_group) {

if constexpr (Scope == clock_scope::work_group || Scope == clock_scope::sub_group) {

Note - do not apply this as is, clang-format will fail because strings should be <= 80 symbols.

Copy link

Contributor

@KornevNikita KornevNikita Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably like:

if constexpr (Scope == clock_scope::work_group ||
 Scope == clock_scope::sub_group) {

@tdavidcl

Copy link

Author

tdavidcl commented Feb 16, 2026

I've added it now it needs a bit of testing. I do not have access to a AMD GPU right now though. The best way of action would be probably a simple test to check that it compiles in all configurations + check that the return is both non zero and monotonically increase in subsequent calls. Where is the best spot to add such a test ?

https://github.com/intel/llvm/tree/sycl/sycl/test-e2e/Clock

Oh perfect it looks like no changes are required in the tests beside enabling the device aspect. Additionally, in the clock test there is this snippet

// UNSUPPORTED: target-native_cpu
// UNSUPPORTED-TRACKER: https://github.com/intel/llvm/issues/20142

I have to check but i think that __builtin_readcyclecounter does support the host and maybe clock() could also be enabled for target-native_cpu.

Conversation

@tdavidcl tdavidcl commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdavidcl commented Feb 12, 2026

Uh oh!

zjin-lcf commented Feb 12, 2026

Uh oh!

Uh oh!

tdavidcl commented Feb 12, 2026

Uh oh!

zjin-lcf commented Feb 12, 2026

Uh oh!

tdavidcl commented Feb 13, 2026

Uh oh!

KornevNikita commented Feb 16, 2026

Uh oh!

KornevNikita commented Feb 16, 2026

Uh oh!

@KornevNikita KornevNikita Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

@KornevNikita KornevNikita Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

tdavidcl commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

@tdavidcl tdavidcl commented Feb 12, 2026 •

edited

Loading

@KornevNikita KornevNikita Feb 16, 2026 •

edited

Loading