This repository was archived by the owner on Jan 15, 2024. It is now read-only.

[WIP] distributed training #1334

Open

ZiyueHuang wants to merge 2 commits into dmlc:master

from ZiyueHuang:dist_train

Open

[WIP] distributed training #1334

ZiyueHuang wants to merge 2 commits into dmlc:master from ZiyueHuang:dist_train

Conversation

ZiyueHuang

Copy link

Member

@ZiyueHuang ZiyueHuang commented Sep 1, 2020 •

edited

Loading

Description

Based on this branch (https://github.com/ZiyueHuang/byteps/tree/mx2), we can perform distributed training for the electra model (and other models). Tested both on a single worker and two workers, each has multiple GPU cards. However, there are two issues:

We should first call trainer._init_params() before the first iteration (forward/backward) to synchronize the parameters across all workers, otherwise trainer will call _init_params inside allreduce_grads (see mxnet/python/mxnet/gluon/trainer.py), thus the synchronization of the parameters actually happen after the first forward/backward computation, meaning that in the first iteration the gradients on different workers are computed w.r.t. different parameters. But in practice this may not be a severe problem, as only the first gradient descent step is not totally correct.
When the model in gluon-nlp confirms to the new coding standard (removing defer_init for all parameters @sxjscience ), then we can directly call trainer._init_params after model.initialize.
Actually I am a little confused about the semantic meaning of defer_init in the numpy mode, because shape with zeros (such as (0, 5)) in the numpy mode will be treated as scalar tensors' shape and zero-size tensors' shape, instead of as unknown in the legacy mode.
I found that we have to build MXNet and BytePS from source on our target machine. Using pip install mxnet (I have tried several wheels in August) will not work for BytePS, either core dump immediately after import bps.mxnet due to undefined symbols (seems because some symbols are not exported into binary, fixed after 0820 wheel), or mysterious segfault (maybe due to c++ ABI issues or other compiler issues, as the mxnet wheels and BytePS (which is setup on our target machine) may not be compiled using the same version of gcc. Besides, there are no related compiler flags such as D_GLIBCXX_USE_CXX11_ABI for MXNet in BytePS's setup.py).

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

cc @dmlc/gluon-nlp-team

@ZiyueHuang


 distributed training

b96c07d

@ZiyueHuang ZiyueHuang requested a review from a team as a code owner

September 1, 2020 10:08

zheyuye

zheyuye reviewed

Sep 1, 2020

View reviewed changes

Copy link

Member

@zheyuye zheyuye left a comment •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and building MXNet from source will work for distributed training

@codecov

Copy link

codecov bot commented Sep 1, 2020 •

edited

Loading

Codecov Report

Merging #1334 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@ Coverage Diff @@
## master #1334 +/- ##
=======================================
 Coverage 67.34% 67.34% 
=======================================
 Files 2 2 
 Lines 98 98 
=======================================
 Hits 66 66 
 Misses 32 32

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c076093...ec793d5. Read the comment docs.

@ZiyueHuang

Copy link

Member Author

ZiyueHuang commented Sep 1, 2020

Just found that the API behavior in BytePS master branch changes recently...... Not sure if that is intended or bugs. Tracked here (bytedance/byteps#292).

@sxjscience

Copy link

Member

sxjscience commented Sep 1, 2020

@ZiyueHuang You may merge the upstream/master since we recently fixed the CI.

@szha

Copy link

Member

szha commented Sep 1, 2020

undefined symbols

What are they?

@szhengac

Copy link

Member

szhengac commented Sep 1, 2020

Horovod also has similar issue that the actual broadcast happens after the first iteration. See here

@sxjscience

Copy link

Member

sxjscience commented Sep 1, 2020

To avoid the problems of horovod, in GluonNLP v1, the convention is to not rely on deferred initialization when implementing the model. For example, for dense layer, we should always give the in_units.

@ZiyueHuang

Copy link

Member Author

ZiyueHuang commented Sep 3, 2020 •

edited

Loading

Sorry for the late reply.

@szha core dump due to undefined symbols is fixed after 0820 wheel, and I didn't record the undefined symbols. For the segfault, below is the stack trace

Fatal Error: Segmentation fault
Stack trace:
Segmentation fault (core dumped)
Traceback (most recent call last):
 File "/home/ubuntu/anaconda3/bin/bpslaunch", line 4, in <module>
 __import__('pkg_resources').run_script('byteps==0.2.4', 'bpslaunch')
 File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 667, in run_script
 self.require(requires)[0].run_script(script_name, ns)
 File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1464, in run_script
 exec(code, namespace, namespace)
 File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 213, in <module>
 launch_bps()
 File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 206, in launch_bps
 t[i].join()
 File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 34, in join
 raise self.exc
 File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 27, in run
 self.ret = self._target(*self._args, **self._kwargs)
 File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 177, in worker
 stdout=sys.stdout, stderr=sys.stderr, shell=True)
 File "/home/ubuntu/anaconda3/lib/python3.7/subprocess.py", line 363, in check_call
 raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python ./test.py' returned non-zero exit status 139.

@sxjscience @szhengac Horovod doesn't have similar issue, since in hvd.broadcast_parameters when there are defer_init parameters, it will register a hook to param.init_impl (see https://github.com/horovod/horovod/blob/9464e20ab2565f25105d5d7f6194ca90d6fca9e5/horovod/mxnet/__init__.py#L121-L127), thus immediately after the actual initialization, Horovod will broadcast the parameters, then forward/backward. However, there are no similar mechanisms in BytePS to handle the logic of defer_init.

@ZiyueHuang


 Merge branch 'master' of https://github.com/dmlc/gluon-nlp into dist_...

ec793d5

...train

@leezu

Copy link

Contributor

leezu commented Sep 4, 2020

Actually I am a little confused about the semantic meaning of defer_init in the numpy mode, because shape with zeros (such as (0, 5)) in the numpy mode will be treated as scalar tensors' shape and zero-size tensors' shape, instead of as unknown in the legacy mode.

You can report it as a bug in MXNet issue tracker

Labels

None yet

6 participants

@ZiyueHuang @sxjscience @szha @szhengac @leezu @zheyuye

[WIP] distributed training #1334

Are you sure you want to change the base?

[WIP] distributed training #1334

Uh oh!

Conversation

@ZiyueHuang ZiyueHuang commented Sep 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Changes

Comments

Uh oh!

@zheyuye zheyuye left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ZiyueHuang commented Sep 1, 2020

Uh oh!

sxjscience commented Sep 1, 2020

Uh oh!

szha commented Sep 1, 2020

Uh oh!

szhengac commented Sep 1, 2020

Uh oh!

sxjscience commented Sep 1, 2020

Uh oh!

ZiyueHuang commented Sep 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leezu commented Sep 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

@ZiyueHuang ZiyueHuang commented Sep 1, 2020 •

edited

Loading

@zheyuye zheyuye left a comment •

edited

Loading

codecov bot commented Sep 1, 2020 •

edited

Loading

ZiyueHuang commented Sep 3, 2020 •

edited

Loading