Dear All,
I propose to set up the Adaptive Distributed Training SIG.
Improving the resource utilization of a deep learning cluster is of paramount concern for many AI practioners. A promising approach is to use elastic deep learning systems. These systems allow users to dynamically change the number of training resources allocated to training jobs. Hence, practitioners can pack a large number of training jobs into a cluster, significantly improving cluster utilization.
Though promising, elastic deep learning systems are difficult to be deployed in practice. State of the art data-parallel elastic ddp learning systems couple the number of training resources with a critical learning hyper-parameter: the batch size of the SGD. Any scaling decisions made by the cluster scheduler therefore must alter the SGD batch size, which affects training results and can even make the training fail to converge.
In this SIG, we will enable the cluster scheduler to dynamically scale a training job without affecting its SGD batch size. To achieve this, we want to explore a novel method to decouple the SGD batch size and the number of training resources, so that the change of training resources does not affect the convergence.
https://gitee.com/mindspore/community/blob/master/design/meps/mep-adaptive/M... https://gitee.com/mindspore/community/tree/master/sigs/adaptivetraining
Best regards
Beilei Sun