Newest 'ddp' Questions

1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

203 questions

0 votes

1 answer

47 views

Meteor 3 connect to a different server

I'm trying to connect my meteor app running on localhost:4000 to the server of another meteor app running on localhost:3000; reading the docs, I'm trying to do import { DDP } from 'meteor/ddp-client' ...

Cereal Killer's user avatar

Cereal Killer

3,438

asked Aug 23, 2025 at 0:36

1 vote

0 answers

70 views

Issue with running training on multigpu using DDP

I am training a classifier model but since it is taking far too long I want to use multigpu for the training. The current code is rank = int(os.environ["RANK"]) world_size = int(os.environ[&...

Shlok Sharma's user avatar

Shlok Sharma

asked Aug 7, 2025 at 6:25

0 votes

0 answers

82 views

Using learning rate schedulers with DDP

I would like to use a learning rate scheduler to (potentially) adapt the learning rate after each epoch, depending on a metric gathered from the validation dataset. However, I am not sure how to use ...

B612's user avatar

B612

asked May 11, 2025 at 11:52

0 votes

0 answers

463 views

Facing issue with connecting to socket with DDP and Pytorch (single node, multi-GPU communication)

I am completely new to distributed programming and I have been trying to port the original code that ran on a multi-node cluster to single-node cluster with multiple GPUs. My goal is to simulate a ...

soumya_sarkar.19's user avatar

soumya_sarkar.19

asked Feb 9, 2025 at 6:29

0 votes

0 answers

419 views

YOLO v11 training multi-GPU DDP Errors

I tried training a yolo model on kangle with 2 Tesla T4 GPUs (15Gb) as follows: model = YOLO('yolo11l.pt') model.train( data='/kaggle/working/tooth-detect-2/data.yaml', epochs=100, batch=...

Lương Vũ Đình Duy's user avatar

Lương Vũ Đình Duy

asked Jan 21, 2025 at 0:55

0 votes

0 answers

322 views

Parameter tuning with Slurm, Optuna, PyTorch Lightning, and KFold

With the following toy script, I am trying to tune a hyper-parameter of "learning rate" for a perceptron with "Optuna" and 5-fold cross validation. My cluster has multiple GPUs on ...

zhihao_li's user avatar

zhihao_li

asked Jun 26, 2024 at 15:09

1 vote

1 answer

713 views

What is the difference between register_parameter(requires_grad=False) and register_buffer in PyTorch?

During deep learning Training, the update of declared values with this variable is done under with torch.no _grad(). When learning with ddp, it's the process of obtaining the average for each batch ...

이원준's user avatar

이원준

asked Apr 15, 2024 at 7:12

2 votes

1 answer

4k views

PyTorch torchrun command can not find rendezvous endpoint, RendezvousConnectionError

I'm practicing PyTorch for multiple node DDP on a docker container, and my program runs properly when I run torchrun \ --nnodes=1 \ --node_rank=0 \ --nproc_per_node=gpu \ ...

GeSol's user avatar

GeSol

asked Sep 23, 2023 at 18:55

2 votes

1 answer

778 views

What is the simplest way to train pytroch-lightning model over a bunch of servers with Dask?

I have access to a couple dozens Dask servers without GPU but with complete control of the software (can wipe them and install something different) and want to accelerate pytorch-lightning model ...

Kirill Setdekov's user avatar

Kirill Setdekov

asked Dec 13, 2022 at 7:56

3 votes

3 answers

7k views

Distributed Data Parallel (DDP) Batch size

Suppose, I use 2 gpus in a DDP setting. So, if I intend to use 16 as a batch size if I run the experiment on a single gpu, should I give 8 as a batch size, or 16 as a batch size in case of using 2 ...

Seungwoo Ryu's user avatar

Seungwoo Ryu

asked Sep 29, 2022 at 16:47

1 vote

1 answer

298 views

How do I wait for successful connection using DDP in meteor (server -> server)

Continuing the discussion from DDP: how do I wait for a connection?: Based on the thread above, we an leverage Tracker.autorun to wait and confirm for a successful connection between a client and ...

blueren's user avatar

blueren

2,888

asked Jan 3, 2022 at 14:49

1 vote

0 answers

323 views

Finding the number of of nodes and gpus of DistributedDataParallel

I would like to know what number should I select for nodes and gpus. I use Tesla V100-SXM2 (8 boards). I tried: nodes = 1, gpus=1 (only the first gpu works) nodes=1, gpus =8 (It took very long ...

Jenny I's user avatar

Jenny I

asked Oct 9, 2021 at 9:59

0 votes

1 answer

98 views

How to separate meteor app as meteor server with GraphQL and ReactJS frontend runs separate to each other?

I have already an app in meteor/ReactJS with GraphQL. I am trying to separate in a way that the Meteor server and ReactJS frontend runs individually and how to link each other. Any example or ...

DharmikSoni's user avatar

DharmikSoni

asked May 17, 2021 at 5:32

3 votes

1 answer

215 views

How to publish data without MongoDB through subscriptions in Meteor?

Essentially, this is what I had in mind : Client const Messages = new Mongo.Collection("messages"); Meteor.call("message", { room: "foo", message: "Hello world"...

Yanick Rochon's user avatar

Yanick Rochon

53.8k

asked Mar 10, 2021 at 14:36

12 votes

5 answers

5k views

Pytorch Lightning duplicates main script in ddp mode

When I launch my main script on the cluster with ddp mode (2 GPU's), Pytorch Lightning duplicates whatever is executed in the main script, e.g. prints or other logic. I need some extended training ...

dlsf's user avatar

dlsf

asked Feb 18, 2021 at 14:13

15 30 50 per page

2 3 4 5

...

14 Next

CollectivesTM on Stack Overflow

Meteor 3 connect to a different server

Issue with running training on multigpu using DDP

Using learning rate schedulers with DDP

Facing issue with connecting to socket with DDP and Pytorch (single node, multi-GPU communication)

YOLO v11 training multi-GPU DDP Errors

Parameter tuning with Slurm, Optuna, PyTorch Lightning, and KFold

What is the difference between register_parameter(requires_grad=False) and register_buffer in PyTorch?

PyTorch torchrun command can not find rendezvous endpoint, RendezvousConnectionError

What is the simplest way to train pytroch-lightning model over a bunch of servers with Dask?

Distributed Data Parallel (DDP) Batch size

How do I wait for successful connection using DDP in meteor (server -> server)

Finding the number of of nodes and gpus of DistributedDataParallel

How to separate meteor app as meteor server with GraphQL and ReactJS frontend runs separate to each other?

How to publish data without MongoDB through subscriptions in Meteor?

Pytorch Lightning duplicates main script in ddp mode

Hot Network Questions