Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gpu state clean-up? #231

Answered by 9prady9
progman1 asked this question in Q&A
Discussion options

I've come across some strange behaviour while experimenting with some simple machine learning code.
I have two compiled versions of a model run differentiated by presence/non-presence of some normal distribution
initialisation - random number generation (but pseudo random ie. starting from the same seed)

Running the non-present version any number of times I see expected results.
Then running another binary (another model) once, after which the prior program will output incorrect results
invariably.
Then running the binary with a normal distribution aspect appears to reset something and thereafter running the original non-present program gives correct results again.

I surmise that the gpu card itself is in some way retaining state between program executions.
The question I'd like to ask is whether there is anything more that can be done through AF other than 'init()' to
put the gpu into a known state (it's an AMD/OpenCL card here)?

You must be logged in to vote

The only state ArrayFire retains across sessions is the kernel binaries saved to the disk (to avoid recompilation on target machine at every startup session of the application, this is also started with recent 3.7.2 release) which don't hard code any runtime information. At every session (i.e. ArrayFire library is loaded into memory), it starts afresh - only global state that gets populated on startup is the information of what devices are accessible.

I don't think even vendor drivers retain any application specific state other than caching kernel binaries, which is for speed related reasons. I know that NVIDIA does this, not sure about AMD - but they also probably do cache kernel binarie...

Replies: 6 comments

Comment options

The only state ArrayFire retains across sessions is the kernel binaries saved to the disk (to avoid recompilation on target machine at every startup session of the application, this is also started with recent 3.7.2 release) which don't hard code any runtime information. At every session (i.e. ArrayFire library is loaded into memory), it starts afresh - only global state that gets populated on startup is the information of what devices are accessible.

I don't think even vendor drivers retain any application specific state other than caching kernel binaries, which is for speed related reasons. I know that NVIDIA does this, not sure about AMD - but they also probably do cache kernel binaries. Nevertheless, none of these store any app specific info.

From what you describe, this seems to be seed related issues but I am not certain about it unless I look at some kind of reproducible code, preferable a stand alone example.

You must be logged in to vote
0 replies
Answer selected by 9prady9
Comment options

would it be acceptable to you if I uploaded the code, which consists of a slightly modified version of a small machine learning library written in rust, to github? I haven't found a way to boil the problem down yet.
...
On 23/07/2020, pradeep ***@***.***> wrote: The only state ArrayFire retains across sessions is the kernel binaries saved to the disk (to avoid recompilation on target machine at every startup session of the application, this is also started with recent 3.7.2 release) which don't hard code any runtime information. At every session (i.e. ArrayFire library is loaded into memory), it starts afresh - only global state that gets populated on startup is the information of what devices are accessible. I don't think even vendor drivers retain any application specific state other than caching kernel binaries, which is for speed related reasons. I know that NVIDIA does this, not sure about AMD - but they also probably do cache kernel binaries. Nevertheless, none of these store any app specific info. From what you describe, this seems to be seed related issues but I am not certain about it unless I look at some kind of reproducible code, preferable a stand alone example. -- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/arrayfire/arrayfire-rust/issues/231#issuecomment-662846833
You must be logged in to vote
0 replies
Comment options

I suspected the rust rand-0.7.3 / OS interaction but discounted that as less likely than the gpu harbouring state given how prevalent rand is in the rust eco system. when you say 'seed related issue' bear in mind that I'm talking about two independent program executables that don't create/retain anything on disk - I see no other way for them to influence each other than through OS state (/dev/random infrastructure perhaps) or hardware. and one is most definitely influencing the other.
You must be logged in to vote
0 replies
Comment options

I can't speak for drivers with 100% certainty, but ArrayFire for sure doesn't keep any global state that persists across sessions.

A short code snippet (that reproduces the issue) would be great. As you said in original description if it is related random number generation, we have only a couple of random number generation functions. Have you tried using the the functions in standalone code, as in bottom-up approach ? Like keep adding your application logic (that surrounds the random number generation) gradually to this standalone program to find which code addition breaks the output consistency. That should help narrow down the problem too.

You must be logged in to vote
0 replies
Comment options

@progman1 Have you figured out what is the problem ? If it is something related to system setup, please share your work around, it would be helpful for any future users who face similar issue. Thank you.

You must be logged in to vote
0 replies
Comment options

FYI - If by chance this has anything to do with arrayfire/arrayfire#2980 , a couple of randu and randn related issues are being handled in that upstream PR.

Closing due to inactivity.

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #231 on December 09, 2020 05:15.

AltStyle によって変換されたページ (->オリジナル) /