I don't seem to understand how clusters work in the parameterServerStrategy in TensorFlow, and I need some clarifications.
I have read this tutorial, but they don't mention or explain clearly how to run parameterServerStrategy using multiple machines. I have a working version, but it is on a single machine, and the workers and the ps don't seem to do anything, it the chief that runs everything. I have tried to implement it on multiple machine where I used their global Ip:s and unused ports for the workers and ps, but the chief does not seem to find them.
asked Feb 22, 2025 at 13:50
ali-saaeddin-1123581321
11 bronze badge
-
Configure the TF_CONFIG environment variable on each machine with the correct cluster addresses and roles to use ParameterServerStrategy across multiple machines. Incorrect TF_CONFIG settings or firewalls are the usual culprits if the chief node can't find other nodes.Sagar– Sagar2025年08月25日 07:55:20 +00:00Commented Aug 25, 2025 at 7:55
lang-py