I propose to set up the Adaptive Distributed Training SIG.
Improving the resource utilization of a deep learning cluster is of paramount concern for many AI practioners. A promising approach is to use elastic deep learning systems. These systems allow users to dynamically change the number of training resources allocated to training jobs. Hence, practitioners can pack a large number of training jobs into a cluster, significantly improving cluster utilization.
Though promising, elastic deep learning systems are difficult to be deployed in practice. State of the art data-parallel elastic ddp learning systems couple the number of training resources with a critical learning hyper-parameter: the batch size of the SGD. Any scaling decisions made by the cluster scheduler therefore must alter the SGD batch size, which affects training results and can even make the training fail to converge.
In this SIG, we will enable the cluster scheduler to dynamically scale a training job without affecting its SGD batch size. To achieve this, we want to explore a novel method to decouple the SGD batch size and the number of training resources, so that the change of training resources does not affect the convergence.
Dear TSC Members,
This is a kind reminder for our upcoming MindSpore TSC Monthly meeting on
Oct 15th. As I introduced in the last email, we have released the v1.0
version at the end of September, and the community is gaining momentum.
In order to further adjust our community structure to adapt to a growing
participation, there will be several changes proposed at the Oct 15th
1. Consolidate into one single monthly meeting for EU-Asia friendly
2. Adding two new non-voting organizations : Expert Committee with its
subsidiary groups and User Committee with its subsidiary groups.
The two proposals will be discussed and then submitted as a resolution PR
to the mindspore/community repo for final landing.
Look forward to meeting you after a long break :)
Zhipeng (Howard) Huang
OpenStack, Kubernetes, CNCF, LF Edge, ONNX, Kubeflow, OpenSDS, Open Service
Broker API, OCP, Hyperledger, ETSI, SNIA, DMTF, W3C