-
Notifications
You must be signed in to change notification settings - Fork 39
[WIP] Feature/patch/softmax cross entropy with logits #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Feature/patch/softmax cross entropy with logits #21
Conversation
Tied to open tensorflow issue #38185.
Non-determinism in backprop of fused implementation
of `{sparse_,}softmax_cross_entropy_with_logits`
has been reported. This work will provide a
patch by routing calls to a deterministic workaround
first described in the tensorflow issue above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll push changes that rough-out enable_determinism and then you can call this from the appropriate part of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the baby steps. Good job!
Comments above. Heads-up, before you're done, please remember to implement test/test_fused_softmax_cross_entropy.py in the spirit of test/test_patch_bias_add.py. This will use the appropriate forward code from the upstream TensorFlow repo along with the deterministic gradient test code that I provided in TensorFlow Issue 38185 (also testing all the modified op bindings).
I'm going to work on adding enable_determinism, from which _patch_fused_softmax_cross_entropy can be called.
tfdeterminism/patch.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at tensorflow/python/ops/nn_ops.py it seems like it might be necessary to remap the bindings for softmax_cross_entropy_with_logits_v2 as well. After a brief review, I'm not sure what the difference between this and the other versions is, though, apart from the deprecated dim parameter.
tfdeterminism/patch.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the reason that _new_bias_add_1_14 has the version number in the name is that before stock TensorFlow version 1.14, tf.nn.bias_add has a different API. I have a version of this patch for TensorFlow version 1.13 and earlier that I didn't release because it was not useful without TF_CUDNN_DETERMINISTIC, which was released in version 1.14 of stock TensorFlow. The deterministic bias_add was effectively released in the NVIDIA GPU Cloud (NGC) TensorFlow container version 19.06, which also implemented TF_DETERMINISTIC_OPS, and which enabled the functionality.
I don't think we need the _1_14 on the end of these method names. However, before enabling this patch to be applied to any given version of TensorFlow (as decided in enable_determinism) it should be confirmed that the patched version of the method has the same API as the original on in that version of TensorFlow.
tfdeterminism/patch.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on the end of this line is not relevant here (same for the non-sparse version). You should search for calls to sparse_softmax_cross_entropy_with_logits (and the non-sparse version) in tensorflow/python and do the re-binding as appropriate for the op (with similar, perhaps in some cases identical, comments about why you're doing it).
I just added enable_determinism with this commit. You should use enable_determinism to apply the patch in test/test_patch_fused_softmax_cross_entropy.py.
I'm going reorganize the testing quite a lot before release. For the testing part, I would like you to focus on getting the following complete and working:
$ cd test
$ echo "Test TF1 API:"
$ ./container.sh tensorflow/tensorflow:1.14.0-gpu python test_patch_fused_softmax_cross_entropy.py
$ echo "Test TF2 API:"
$ ./container.sh tensorflow/tensorflow:2.2.0-gpu python test_patch_fused_softmax_cross_entropy.py
...ps://github.com/MFreidank/tensorflow-determinism into feature/patch/softmax_cross_entropy_with_logits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This is the condition for setting TF_DETERMINISTIC_OPS=1: NGC containers with version >= 19.06 or stock TensorFlow with version >= 2.1.
The condition below will ensure that the fused softmax/cross-entropy patch is applied to NGC containers with version >= 19.06 or stock TensorFlow with version >= 1.14 (which includes versions 2.1 and 2.2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to deprecate patch, not make it work on newer versions of TensorFlow, and no longer advertise it in the documentation (will be updated before release). If folks want determinism in TensorFlow version 2.1 or 2.2 then they should be using enable_determinism. Not adding this does not break anything because there was no patch available for TensorFlow versions 2.1 and 2.2 before.
phqtuyen
commented
Aug 5, 2020
Hi, my name is Thomas, I am working at the Oracle Digital Assistant team in which we made extensively use of Tensorflow and Tensorflow-determinism, we really appreciate this wonderful library. I just wish to ask is this issue going to be merged anytime soon?
Hi Thomas (@phqtuyen),
Thanks for letting us know. These ops seem to be generally high priority for determinism in TensorFlow.
@MFreidank: please can you give an update on this and/or move it forward. Most of the rest of the infrastructure is now in place for this.
Hi @phqtuyen, @duncanriach
Sorry for the wait, I had been pulled into other things.
I've picked up work on this now and will have an update to the PR before end of this week.
I'll base my changes on the new infrastructure recently put in place.
Awesome. Just FYI, I'll be offline next week (2020年08月17日 through 2020年08月21日) and, since I'll need to be involved in reviews and iterations, that might slow things down a bit.
Hey @MFreidank, are you still planning on completing this PR?
Hi @MFreidank, I'm about to go offline again until 2020年09月16日, but I wanted to check-in with you again. I think that this op's nondeterminism is near the top of the priority/urgency list for TensorFlow, and I want to get it resolved ASAP. Can you commit to pushing this through from your end?
Thank you, Moritz.
Hi again @duncanriach,
I've implemented a first version (on top of the most recent master branch with enable_determinism) that patches only tf.nn.sparse_softmax_cross_entropy_with_logits and tf.nn.softmax_cross_entropy_with_logits and integrated the unit test case you shared into test/test_softmax_cross_entropy_with_logits.py.
To test that this works as intended I'll need access to my GPU which will only be available to me in another 6 hours.
If the test cases pass, I'll update the PR with the above.
What remains to be determined is which other functions need to be patched after that (by scanning through tensorflow/python as outlined by you above).
I will first focus on getting the minimal patch for the functions in tf.nn to work including unit tests and will then broaden it to include the other functions.
Thanks, @MFreidank. Sounds good. Just to reiterate, I'll be offline until 2020年09月16日 and I'll review this after that.
Hey @MFreidank, please will you submit those changes so that I can review them?
@MFreidank, we will soon release version 0.4.0. We intend to include the fused softmax/cross-entropy patch in version 0.5.0 and I want to get started on it ASAP. If I don't hear from you in the 24 hours, we will proceed to implement it without you.
Addressing previously reported tensorflow issue #38185. Also tied to
tfdeterminismissues #19, #14, #9.Fused implementation of
{sparse_,}softmax_cross_entropy_with_logitsexhibits non-determinism in its backprop.This PR will provide a patch that maintains the same function interface but computes the computational steps involved in a non-fused way that is fully deterministic.