-
Notifications
You must be signed in to change notification settings - Fork 77
Distributed Data Parallel communication hook #667
-
Hello,
I am working on a simple idea for a submission to this contest. My idea requires a communication hook to be registered for the distributed data parallel model from pytorch. Essentially, I want to calculate the gradient, then perform some calculation on the gradient separately on each GPU, then all_reduce the results. I do not think that this violates the spirit of the rules, but please let me know if you agree. Thank you for your time.
-David Tweedle
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment 2 replies
-
Hi David,
It sounds like you're asking whether you can perform manipulations to the gradients per data shard. Is that correct?
If that is the question, I think that is within the spirit of the rules.
Also, have you checked out our submission API in https://github.com/mlcommons/algorithmic-efficiency/blob/main/submissions/template/submission.py and example implementations (https://github.com/mlcommons/algorithmic-efficiency/blob/main/reference_algorithms/paper_baselines/adamw/pytorch/submission.py#L93). The idea is that submitters are free to implement each of the submission APIs as they wish. Our workload loss functions return 'unreduced' loss values so I believe you should be able to compute and perform calculations on the gradients per shard.
@runame can you confirm?
Beta Was this translation helpful? Give feedback.
All reactions
-
Agreed, this should definitely be within the spirit of the rules.
Beta Was this translation helpful? Give feedback.
All reactions
-
Yes, I want to perform calculations on each shard. I am able to see how I can do so after looking closer at the examples. Thank you for your help.
Beta Was this translation helpful? Give feedback.