I propose to set up the Adaptive Distributed Training SIG.
Improving the resource utilization of a deep learning cluster is of paramount concern for
many AI practioners. A promising approach is to use elastic deep learning systems. These
systems allow users to dynamically change the number of training resources allocated to
training jobs. Hence, practitioners can pack a large number of training jobs into a
cluster, significantly improving cluster utilization.
Though promising, elastic deep learning systems are difficult to be deployed in practice.
State of the art data-parallel elastic ddp learning systems couple the number of training
resources with a critical learning hyper-parameter: the batch size of the SGD. Any scaling
decisions made by the cluster scheduler therefore must alter the SGD batch size, which
affects training results and can even make the training fail to converge.
In this SIG, we will enable the cluster scheduler to dynamically scale a training job
without affecting its SGD batch size. To achieve this, we want to explore a novel method
to decouple the SGD batch size and the number of training resources, so that the change of
training resources does not affect the convergence.