Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[sycl_ext_oneapi_clock] implement NVPTX case#21280

Open
tdavidcl wants to merge 4 commits intointel:sycl from
tdavidcl:ptx-clock
Open

[sycl_ext_oneapi_clock] implement NVPTX case #21280
tdavidcl wants to merge 4 commits intointel:sycl from
tdavidcl:ptx-clock

Conversation

@tdavidcl
Copy link

@tdavidcl tdavidcl commented Feb 12, 2026
edited
Loading

Hi after suggestion from @zjin-lcf here is a PR (context: KhronosGroup/SYCL-Docs#958).
It implements the NVPTX variant of clock() using the %%clock64 special register from PTX.

https://docs.nvidia.com/cuda/archive/10.1/parallel-thread-execution/index.html?utm_source=chatgpt.com#special-registers-clock64

PTX ISA Notes
Introduced in PTX ISA version 2.0.

So it is safe to assume that the register is supported regardless of the PTX version used since intel llvm assume >5.0 if I recall correctly.

reference for usage internally to llvm (on this repo actually, nice :) )

__DEVICE__ long long clock64() { return __nvvm_read_ptx_sreg_clock64(); }

(there is a typo in the PR which is already corrected by a commit, but i don't why it is not updating in the PR ...)

@tdavidcl tdavidcl requested a review from a team as a code owner February 12, 2026 19:16
Copy link
Author

Also I just found out that there is this file in LLVM libc/src/__support/GPU/utils.h which does define
uint64_t processor_clock() { return __builtin_readcyclecounter(); }
which is used in all test apparently.

We could maybe use that for both Nvidia and AMD since that's what is called within the CI.

Copy link
Contributor

Thank you. I found some post ROCm/ROCm#1288 that may be related to your comments.

Co-authored-by: Alexey Bader <alexey.bader@intel.com>
Copy link
Author

Thank you. I found some post ROCm/ROCm#1288 that may be related to your comments.

It seems that the native builtins are better whenever available. I can try to replace the amd & the else branch by __builtin_readcyclecounter then ?

Copy link
Contributor

@tdavidcl Please give a try for the amd and the else branch. Thanks.

tdavidcl reacted with thumbs up emoji

Copy link
Author

I've added it now it needs a bit of testing. I do not have access to a AMD GPU right now though. The best way of action would be probably a simple test to check that it compiles in all configurations + check that the return is both non zero and monotonically increase in subsequent calls. Where is the best spot to add such a test ?

Copy link
Contributor

I've added it now it needs a bit of testing. I do not have access to a AMD GPU right now though. The best way of action would be probably a simple test to check that it compiles in all configurations + check that the return is both non zero and monotonically increase in subsequent calls. Where is the best spot to add such a test ?

https://github.com/intel/llvm/tree/sycl/sycl/test-e2e/Clock

Copy link
Contributor

@tdavidcl thanks for working on this! Also, these functions require device to support aspects:

#ifdef __SYCL_DEVICE_ONLY__
[[__sycl_detail__::__uses_aspects__(sycl::aspect::ext_oneapi_clock_device)]]
#endif

That means we also need something like this but for CUDA adapter.
tdavidcl reacted with thumbs up emoji

// this is due to potential higher overhead compared to a native API call
// see : https://github.com/ROCm/ROCm/issues/1288
#if defined(__NVPTX__)
if constexpr (Scope == work_group || Scope == sub_group) {
Copy link
Contributor

@KornevNikita KornevNikita Feb 16, 2026
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if constexpr (Scope == work_group || Scope == sub_group) {
if constexpr (Scope == clock_scope::work_group || Scope == clock_scope::sub_group) {

Note - do not apply this as is, clang-format will fail because strings should be <= 80 symbols.

tdavidcl reacted with thumbs up emoji
Copy link
Contributor

@KornevNikita KornevNikita Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably like:

if constexpr (Scope == clock_scope::work_group ||
 Scope == clock_scope::sub_group) {

Copy link
Author

I've added it now it needs a bit of testing. I do not have access to a AMD GPU right now though. The best way of action would be probably a simple test to check that it compiles in all configurations + check that the return is both non zero and monotonically increase in subsequent calls. Where is the best spot to add such a test ?

https://github.com/intel/llvm/tree/sycl/sycl/test-e2e/Clock

Oh perfect it looks like no changes are required in the tests beside enabling the device aspect. Additionally, in the clock test there is this snippet

// UNSUPPORTED: target-native_cpu
// UNSUPPORTED-TRACKER: https://github.com/intel/llvm/issues/20142

I have to check but i think that __builtin_readcyclecounter does support the host and maybe clock() could also be enabled for target-native_cpu.

KornevNikita reacted with thumbs up emoji

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@bader bader bader left review comments

@KornevNikita KornevNikita KornevNikita left review comments

@cperkinsintel cperkinsintel Awaiting requested review from cperkinsintel cperkinsintel is a code owner automatically assigned from intel/llvm-reviewers-runtime

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

Comments

AltStyle によって変換されたページ (->オリジナル) /