Ph.D. Final Oral Exam: Tianxiang Gao
Speaker:Tianxiang Gao
Mastering Infinite Depths: Optimization and Generalization in Deeper Neural Networks
In recent years, overparameterized neural networks have seen widespread adoption in practical applications, demonstrating notable numerical and practical success. This success has sparked theoretical investigations aimed at understanding the underlying mysteries. Despite having a significantly larger number of parameters than training samples, overparameterized neural networks can achieve zero training loss and maintain good performance on unseen data, a phenomenon known as benign overfitting. While previous theoretical studies have produced interesting findings supporting these practical observations, they have primarily focused on shallow neural networks, particularly two-layer networks. However, in practice, deep neural networks with large depths are favored due to their greater expressive capacity and task performance.
To address this gap, we initiate a study on large-depth neural networks by investigating infinite-depth neural networks, specifically, deep equilibrium models (DEQs) and neural ordinary differential equations (neural ODEs). Our analysis reveals that for infinite-depth or large-depth neural networks, the dynamical stability is crucial not only for training but also for generalization. Consequently, to achieve faster training convergence and better generalization performance, a carefully selected scaling strategy and the use of different skip connections are essential to stabilize the information propagation in large-depth neural networks during both forward function evaluation and backward gradient computation.
We propose a simple yet effective scaling strategy and utilize different skip connections to ensure the stability of information propagation in infinite-depth neural networks. As a result, we demonstrate that gradient descent can train infinite-depth neural networks, including DEQs and Neural ODEs, to achieve zero training error and arbitrary small generalization error, provided that the neural network is overparameterized and the training sample is sufficient. Furthermore, we conduct numerical experiments to validate and support our theoretical findings. Additionally, we believe that the techniques introduced in this dissertation are not only applicable to DEQs and Neural ODEs but also extendable to other neural network architectures with large depths, such as residual neural networks, large-depth recurrent neural networks, and graph neural networks.