Search code, repositories, users, issues, pull requests...

Copy link

@MFreidank MFreidank commented Jun 22, 2020

Addressing previously reported tensorflow issue #38185. Also tied to tfdeterminism issues #19, #14, #9.
Fused implementation of {sparse_,}softmax_cross_entropy_with_logits exhibits non-determinism in its backprop.
This PR will provide a patch that maintains the same function interface but computes the computational steps involved in a non-fused way that is fully deterministic.

MFreidank added 2 commits

June 22, 2020 09:56


 patch {sparse_,}softmax_cross_entropy_with_logits

53b4ce3

Tied to open tensorflow issue #38185.
Non-determinism in backprop of fused implementation
of `{sparse_,}softmax_cross_entropy_with_logits`
has been reported. This work will provide a
patch by routing calls to a deterministic workaround
first described in the tensorflow issue above.


 fix: line no for link to previous implementation

aeb70ac

@MFreidank MFreidank mentioned this pull request

Add GPU-deterministic back-prop for fused softmax/cross-entropy ops tensorflow/tensorflow#38185

Closed

MFreidank added 2 commits

June 22, 2020 10:09


 make this patch apply to tf 2.1 and 2.2 as well

bcc8475


 fix: ammend mistake in re.match(...) statement

f7d6b86

duncanriach reviewed

tfdeterminism/patch.py

if re.match("(1\.(14|15)|2\.0)", tf_version):

os.environ['TF_CUDNN_DETERMINISTIC'] = '1'

_patch_bias_add()

_patch_fused_softmax_cross_entropy()

Copy link

Collaborator

@duncanriach duncanriach Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll push changes that rough-out enable_determinism and then you can call this from the appropriate part of that.

duncanriach reviewed

tfdeterminism/patch.py

same as `logits` and its shape is the same as `labels` except that it does

not have the last dimension of `labels`.

"""

raise NotImplementedError()

Copy link

Collaborator

@duncanriach duncanriach Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the baby steps. Good job!

Copy link

Collaborator

duncanriach commented Jun 22, 2020 •

edited

Loading

Comments above. Heads-up, before you're done, please remember to implement test/test_fused_softmax_cross_entropy.py in the spirit of test/test_patch_bias_add.py. This will use the appropriate forward code from the upstream TensorFlow repo along with the deterministic gradient test code that I provided in TensorFlow Issue 38185 (also testing all the modified op bindings).

I'm going to work on adding enable_determinism, from which _patch_fused_softmax_cross_entropy can be called.

duncanriach reviewed

tfdeterminism/patch.py Outdated

Comment on lines 149 to 151

tf.nn.softmax_cross_entropy_with_logits = _new_softmax_cross_entropy_with_logits_1_14 # access via public API

nn.softmax_cross_entropy_with_logits = _new_softmax_cross_entropy_with_logits_1_14 # called from tf.keras.layers.convolutional.Conv

nn_ops.softmax_cross_entropy_with_logits = _new_softmax_cross_entropy_with_logits_1_14 # called from tests

Copy link

Collaborator

@duncanriach duncanriach Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at tensorflow/python/ops/nn_ops.py it seems like it might be necessary to remap the bindings for softmax_cross_entropy_with_logits_v2 as well. After a brief review, I'm not sure what the difference between this and the other versions is, though, apart from the deprecated dim parameter.

duncanriach reviewed

tfdeterminism/patch.py Outdated

# The original, pre-patched method can be viewed at

# https://github.com/tensorflow/tensorflow/blob/v1.14.0/tensorflow/python/ops/nn_ops.py#L3182

def _new_softmax_cross_entropy_with_logits_1_14(labels, logits, axis=-1, name=None):

Copy link

Collaborator

@duncanriach duncanriach Jun 22, 2020 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the reason that _new_bias_add_1_14 has the version number in the name is that before stock TensorFlow version 1.14, tf.nn.bias_add has a different API. I have a version of this patch for TensorFlow version 1.13 and earlier that I didn't release because it was not useful without TF_CUDNN_DETERMINISTIC, which was released in version 1.14 of stock TensorFlow. The deterministic bias_add was effectively released in the NVIDIA GPU Cloud (NGC) TensorFlow container version 19.06, which also implemented TF_DETERMINISTIC_OPS, and which enabled the functionality.

I don't think we need the _1_14 on the end of these method names. However, before enabling this patch to be applied to any given version of TensorFlow (as decided in enable_determinism) it should be confirmed that the patched version of the method has the same API as the original on in that version of TensorFlow.

duncanriach reviewed

tfdeterminism/patch.py Outdated

# Non-sparse

tf.nn.sparse_softmax_cross_entropy_with_logits = _new_sparse_softmax_cross_entropy_with_logits_1_14 # access via public API

nn.sparse_softmax_cross_entropy_with_logits = _new_sparse_softmax_cross_entropy_with_logits_1_14 # called from tf.keras.layers.convolutional.Conv

Copy link

Collaborator

@duncanriach duncanriach Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on the end of this line is not relevant here (same for the non-sparse version). You should search for calls to sparse_softmax_cross_entropy_with_logits (and the non-sparse version) in tensorflow/python and do the re-binding as appropriate for the op (with similar, perhaps in some cases identical, comments about why you're doing it).

Copy link

Collaborator

duncanriach commented Jun 23, 2020 •

edited

Loading

I just added enable_determinism with this commit. You should use enable_determinism to apply the patch in test/test_patch_fused_softmax_cross_entropy.py.

I'm going reorganize the testing quite a lot before release. For the testing part, I would like you to focus on getting the following complete and working:

$ cd test
$ echo "Test TF1 API:"
$ ./container.sh tensorflow/tensorflow:1.14.0-gpu python test_patch_fused_softmax_cross_entropy.py
$ echo "Test TF2 API:"
$ ./container.sh tensorflow/tensorflow:2.2.0-gpu python test_patch_fused_softmax_cross_entropy.py

MFreidank added 4 commits

June 23, 2020 14:58


 twd: softmax entropy patch in enable_determinism

75ce3db


 make this patch apply to tf 2.1 and 2.2 as well

231c7c5


 fix: ammend mistake in re.match(...) statement

f22bcc9


 Merge branch 'feature/patch/softmax_cross_entropy_with_logits' of htt...

24b4f57

...ps://github.com/MFreidank/tensorflow-determinism into feature/patch/softmax_cross_entropy_with_logits

duncanriach reviewed

Jun 24, 2020

tfdeterminism/enable_determinism.py

_patch_bias_add()

if in_ngc_cont and ngc_vers.at_least('19.06') or tf_vers.at_least('2.1'):

os.environ['TF_DETERMINISTIC_OPS'] = '1'

# TODO: Add patch crossentropy here as well? Issue seems to still be present on tf 2.1, 2.2

Copy link

Collaborator

@duncanriach duncanriach Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is the condition for setting TF_DETERMINISTIC_OPS=1: NGC containers with version >= 19.06 or stock TensorFlow with version >= 2.1.

The condition below will ensure that the fused softmax/cross-entropy patch is applied to NGC containers with version >= 19.06 or stock TensorFlow with version >= 1.14 (which includes versions 2.1 and 2.2).

duncanriach reviewed

Jun 24, 2020

tfdeterminism/patch.py

print("TensorFlow version %s has been patched "

"using tfdeterminism version %s" %

(tf_version, __version__), file=sys.stderr)

elif re.match("2\.1|2\.2", tf_version):

Copy link

Collaborator

@duncanriach duncanriach Jun 24, 2020 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to deprecate patch, not make it work on newer versions of TensorFlow, and no longer advertise it in the documentation (will be updated before release). If folks want determinism in TensorFlow version 2.1 or 2.2 then they should be using enable_determinism. Not adding this does not break anything because there was no patch available for TensorFlow versions 2.1 and 2.2 before.

@phqtuyen

Copy link

phqtuyen commented Aug 5, 2020

Hi, my name is Thomas, I am working at the Oracle Digital Assistant team in which we made extensively use of Tensorflow and Tensorflow-determinism, we really appreciate this wonderful library. I just wish to ask is this issue going to be merged anytime soon?

Copy link

Collaborator

duncanriach commented Aug 7, 2020

Hi Thomas (@phqtuyen),

Thanks for letting us know. These ops seem to be generally high priority for determinism in TensorFlow.

@MFreidank: please can you give an update on this and/or move it forward. Most of the rest of the infrastructure is now in place for this.

Copy link

Author

MFreidank commented Aug 11, 2020

Hi @phqtuyen, @duncanriach
Sorry for the wait, I had been pulled into other things.
I've picked up work on this now and will have an update to the PR before end of this week.
I'll base my changes on the new infrastructure recently put in place.

Copy link

Collaborator

duncanriach commented Aug 11, 2020

Awesome. Just FYI, I'll be offline next week (2020年08月17日 through 2020年08月21日) and, since I'll need to be involved in reviews and iterations, that might slow things down a bit.

Copy link

Collaborator

duncanriach commented Sep 2, 2020 •

edited

Loading

Hey @MFreidank, are you still planning on completing this PR?

Copy link

Collaborator

duncanriach commented Sep 4, 2020 •

edited

Loading

Hi @MFreidank, I'm about to go offline again until 2020年09月16日, but I wanted to check-in with you again. I think that this op's nondeterminism is near the top of the priority/urgency list for TensorFlow, and I want to get it resolved ASAP. Can you commit to pushing this through from your end?

Copy link

Author

MFreidank commented Sep 4, 2020 via email

Hi Duncan, Sorry for the delay on this. Yes, I'll pick it up starting tomorrow my time. Duncan Riach <notifications@github.com> schrieb am Fr., 4. Sep. 2020, 20:07:

...

Hi @Freidank, I'm about to go offline again until 2020年09月16日, but I wanted to check-in with you again. I think that this op's nondeterminism is near the top of the priority/urgency list for TensorFlow, and I want to get it resolved ASAP. Can you commit to pushing this through from your end? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQSWKBQYI2MLT4QV42MP6DSEEUFJANCNFSM4OEKR34Q> .

Copy link

Collaborator

duncanriach commented Sep 4, 2020

Thank you, Moritz.

Copy link

Author

MFreidank commented Sep 5, 2020 •

edited

Loading

Hi again @duncanriach,
I've implemented a first version (on top of the most recent master branch with enable_determinism) that patches only tf.nn.sparse_softmax_cross_entropy_with_logits and tf.nn.softmax_cross_entropy_with_logits and integrated the unit test case you shared into test/test_softmax_cross_entropy_with_logits.py.
To test that this works as intended I'll need access to my GPU which will only be available to me in another 6 hours.
If the test cases pass, I'll update the PR with the above.
What remains to be determined is which other functions need to be patched after that (by scanning through tensorflow/python as outlined by you above).
I will first focus on getting the minimal patch for the functions in tf.nn to work including unit tests and will then broaden it to include the other functions.

Copy link

Collaborator

duncanriach commented Sep 5, 2020

Thanks, @MFreidank. Sounds good. Just to reiterate, I'll be offline until 2020年09月16日 and I'll review this after that.

Copy link

Collaborator

duncanriach commented Sep 22, 2020

Hey @MFreidank, please will you submit those changes so that I can review them?