-
Notifications
You must be signed in to change notification settings - Fork 529
[WIP] distributed training #1334
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and building MXNet from source will work for distributed training
Codecov Report
Merging #1334 into master will not change coverage.
The diff coverage isn/a
.
@@ Coverage Diff @@ ## master #1334 +/- ## ======================================= Coverage 67.34% 67.34% ======================================= Files 2 2 Lines 98 98 ======================================= Hits 66 66 Misses 32 32
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update c076093...ec793d5. Read the comment docs.
Just found that the API behavior in BytePS master branch changes recently...... Not sure if that is intended or bugs. Tracked here (bytedance/byteps#292).
@ZiyueHuang You may merge the upstream/master since we recently fixed the CI.
undefined symbols
What are they?
Horovod also has similar issue that the actual broadcast happens after the first iteration. See here
To avoid the problems of horovod, in GluonNLP v1, the convention is to not rely on deferred initialization when implementing the model. For example, for dense layer, we should always give the in_units.
Sorry for the late reply.
@szha core dump due to undefined symbols is fixed after 0820 wheel, and I didn't record the undefined symbols. For the segfault, below is the stack trace
Fatal Error: Segmentation fault
Stack trace:
Segmentation fault (core dumped)
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/bin/bpslaunch", line 4, in <module>
__import__('pkg_resources').run_script('byteps==0.2.4', 'bpslaunch')
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1464, in run_script
exec(code, namespace, namespace)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 213, in <module>
launch_bps()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 206, in launch_bps
t[i].join()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 34, in join
raise self.exc
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 27, in run
self.ret = self._target(*self._args, **self._kwargs)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 177, in worker
stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/home/ubuntu/anaconda3/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python ./test.py' returned non-zero exit status 139.
@sxjscience @szhengac Horovod doesn't have similar issue, since in hvd.broadcast_parameters
when there are defer_init parameters, it will register a hook to param.init_impl
(see https://github.com/horovod/horovod/blob/9464e20ab2565f25105d5d7f6194ca90d6fca9e5/horovod/mxnet/__init__.py#L121-L127), thus immediately after the actual initialization, Horovod will broadcast the parameters, then forward/backward. However, there are no similar mechanisms in BytePS to handle the logic of defer_init.
Actually I am a little confused about the semantic meaning of defer_init in the numpy mode, because shape with zeros (such as (0, 5)) in the numpy mode will be treated as scalar tensors' shape and zero-size tensors' shape, instead of as unknown in the legacy mode.
You can report it as a bug in MXNet issue tracker
Uh oh!
There was an error while loading. Please reload this page.
Description
Based on this branch (https://github.com/ZiyueHuang/byteps/tree/mx2), we can perform distributed training for the electra model (and other models). Tested both on a single worker and two workers, each has multiple GPU cards. However, there are two issues:
We should first call
trainer._init_params()
before the first iteration (forward/backward) to synchronize the parameters across all workers, otherwisetrainer
will call_init_params
insideallreduce_grads
(see mxnet/python/mxnet/gluon/trainer.py), thus the synchronization of the parameters actually happen after the first forward/backward computation, meaning that in the first iteration the gradients on different workers are computed w.r.t. different parameters. But in practice this may not be a severe problem, as only the first gradient descent step is not totally correct.When the model in gluon-nlp confirms to the new coding standard (removing defer_init for all parameters @sxjscience ), then we can directly call
trainer._init_params
aftermodel.initialize
.Actually I am a little confused about the semantic meaning of defer_init in the numpy mode, because shape with zeros (such as
(0, 5)
) in the numpy mode will be treated as scalar tensors' shape and zero-size tensors' shape, instead of as unknown in the legacy mode.I found that we have to build MXNet and BytePS from source on our target machine. Using
pip install mxnet
(I have tried several wheels in August) will not work for BytePS, either core dump immediately afterimport bps.mxnet
due to undefined symbols (seems because some symbols are not exported into binary, fixed after 0820 wheel), or mysterious segfault (maybe due to c++ ABI issues or other compiler issues, as the mxnet wheels and BytePS (which is setup on our target machine) may not be compiled using the same version of gcc. Besides, there are no related compiler flags such as D_GLIBCXX_USE_CXX11_ABI for MXNet in BytePS's setup.py).Checklist
Essentials
Changes
Comments
cc @dmlc/gluon-nlp-team