On the loss landscape of deep neural networks
Date issued
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
License
Abstract
As a non-convex optimization problem, the training of deep neural networks remains poorly understood, and its success critically depends on the exact network architecture used. While the amount of new network architectures proposed in the last decade is staggering, only a handful of common patterns emerged that are shared by most successful architectures. First, using the smoothness of the optimization landscape as a heuristic for trainability, we investigate what network components render training difficult and how these patterns help alleviate such difficulties. We find that while giving networks their expressivity, deep stacks of nonlinear layers significantly increase the roughness of the optimization landscape as network depth increases. Developing prior work, we quantify this effect and show that for networks at initialization, the strength of this effect depends on the smoothness of the nonlinear layer used. We then demonstrate how residual connections and multi-path architectures reduce high frequencies in the optimization landscape, resulting in increased trainability. Second, we found that normalization layers combined with an adequate warm-up scheme compensate for the increasing roughness in lower layers by dynamically re-scaling the layer-wise gradients. We prove that in a properly normalized network, all layer-wise effective learning speeds align over time, compensating for even exponentially exploding gradients at initialization. Finally, we conduct an empirical study to determine the necessary nonlinear depth of a network to generalize effectively on common deep learning tasks. Surprisingly, we find that a shallow network extracted after training significantly outperforms a comparably shallow network trained from scratch, although their expressivity is exactly the same. We also observe that ensembles of both shallow and deep paths outperform comparable networks comprised of only deep paths, even when extracted after training. Using these insights, we aim to gain a deeper understanding of how to design deep neural networks with high trainability and strong generalization properties.