A Sum-of-Ratios Multi-Dimensional-Knapsack Decomposition for DNN Resource Scheduling
To sustain the resource-intensive computational needs for training deep neural networks (DNNs) in the post-Moore's-Law era, it is widely accepted that one should exploit the parallelism in large-scale computing clusters to deploy DNN training jobs. However, existing resource schedulers for traditional computing clusters are not well suited for DNN training, which results in unsatisfactory job completion time performance. The limitation of these resource scheduling schemes motivates us to propose a new computing cluster resource scheduling framework that is able to exploit the special layered structure of DNN jobs and significantly improve their job completion time. Our contributions in this paper are three-fold: i) We develop a new resource scheduling analytical model by considering DNN's layered structure, which enables us to rigorously formulate the resource scheduling optimization problem for DNN training in computing clusters; ii) Based on the proposed performance analytical model, we then develop an efficient resource scheduling algorithm based on a sum-of-ratios multi-dimensional-knapsack decomposition (SMD) method to offer strong performance guarantee; iii) We conduct extensive numerical experiments to demonstrate the effectiveness of the proposed schedule algorithm and its superior performance over the state of the art.
Committee: Jia Liu (major professor), Pavan Aduri, Hongwei Zhang, Wensheng Zhang, Hridesh Rajan