203 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
0
votes
1
answer
47
views
Meteor 3 connect to a different server
I'm trying to connect my meteor app running on localhost:4000 to the server of another meteor app running on localhost:3000;
reading the docs, I'm trying to do
import { DDP } from 'meteor/ddp-client'
...
1
vote
0
answers
70
views
Issue with running training on multigpu using DDP
I am training a classifier model but since it is taking far too long I want to use multigpu for the training. The current code is
rank = int(os.environ["RANK"])
world_size = int(os.environ[&...
0
votes
0
answers
82
views
Using learning rate schedulers with DDP
I would like to use a learning rate scheduler to (potentially) adapt the learning rate after each epoch, depending on a metric gathered from the validation dataset. However, I am not sure how to use ...
0
votes
0
answers
463
views
Facing issue with connecting to socket with DDP and Pytorch (single node, multi-GPU communication)
I am completely new to distributed programming and I have been trying to port the original code that ran on a multi-node cluster to single-node cluster with multiple GPUs. My goal is to simulate a ...
0
votes
0
answers
419
views
YOLO v11 training multi-GPU DDP Errors
I tried training a yolo model on kangle with 2 Tesla T4 GPUs (15Gb) as follows:
model = YOLO('yolo11l.pt')
model.train(
data='/kaggle/working/tooth-detect-2/data.yaml',
epochs=100,
batch=...
0
votes
0
answers
322
views
Parameter tuning with Slurm, Optuna, PyTorch Lightning, and KFold
With the following toy script, I am trying to tune a hyper-parameter of "learning rate" for a perceptron with "Optuna" and 5-fold cross validation. My cluster has multiple GPUs on ...
1
vote
1
answer
713
views
What is the difference between register_parameter(requires_grad=False) and register_buffer in PyTorch?
During deep learning Training, the update of declared values with this variable is done under with torch.no _grad().
When learning with ddp, it's the process of obtaining the average for each batch ...
2
votes
1
answer
4k
views
PyTorch torchrun command can not find rendezvous endpoint, RendezvousConnectionError
I'm practicing PyTorch for multiple node DDP on a docker container,
and my program runs properly when I run
torchrun \
--nnodes=1 \
--node_rank=0 \
--nproc_per_node=gpu \
...
2
votes
1
answer
778
views
What is the simplest way to train pytroch-lightning model over a bunch of servers with Dask?
I have access to a couple dozens Dask servers without GPU but with complete control of the software (can wipe them and install something different) and want to accelerate pytorch-lightning model ...
3
votes
3
answers
7k
views
Distributed Data Parallel (DDP) Batch size
Suppose, I use 2 gpus in a DDP setting.
So, if I intend to use 16 as a batch size if I run the experiment on a single gpu,
should I give 8 as a batch size, or 16 as a batch size in case of using 2 ...
1
vote
1
answer
298
views
How do I wait for successful connection using DDP in meteor (server -> server)
Continuing the discussion from DDP: how do I wait for a connection?:
Based on the thread above, we an leverage Tracker.autorun to wait and confirm for a successful connection between a client and ...
1
vote
0
answers
323
views
Finding the number of of nodes and gpus of DistributedDataParallel
I would like to know what number should I select for nodes and gpus.
I use Tesla V100-SXM2 (8 boards).
I tried:
nodes = 1, gpus=1 (only the first gpu works)
nodes=1, gpus =8 (It took very long ...
0
votes
1
answer
98
views
How to separate meteor app as meteor server with GraphQL and ReactJS frontend runs separate to each other?
I have already an app in meteor/ReactJS with GraphQL. I am trying to separate in a way that the Meteor server and ReactJS frontend runs individually and how to link each other. Any example or ...
3
votes
1
answer
215
views
How to publish data without MongoDB through subscriptions in Meteor?
Essentially, this is what I had in mind :
Client
const Messages = new Mongo.Collection("messages");
Meteor.call("message", { room: "foo", message: "Hello world"...
12
votes
5
answers
5k
views
Pytorch Lightning duplicates main script in ddp mode
When I launch my main script on the cluster with ddp mode (2 GPU's), Pytorch Lightning duplicates whatever is executed in the main script, e.g. prints or other logic. I need some extended training ...