On the loss landscape of deep neural networks

Mehmeti-Göpel, Christian Heinrich Xhemal Ali

On the loss landscape of deep neural networks

Files

on_the_loss_landscape_of_deep-20250117153052220.pdf (9.81 MB)

Date issued

2025

Authors

Mehmeti-Göpel, Christian Heinrich Xhemal Ali

License

CC-BY-SA-4.0
https://creativecommons.org/licenses/by-sa/4.0/

Item

Dissertation

Open Access

Abstract

As a non-convex optimization problem, the training of deep neural networks remains poorly understood, and its success critically depends on the exact network architecture used. While the amount of new network architectures proposed in the last decade is staggering, only a handful of common patterns emerged that are shared by most successful architectures. First, using the smoothness of the optimization landscape as a heuristic for trainability, we investigate what network components render training difficult and how these patterns help alleviate such difficulties. We find that while giving networks their expressivity, deep stacks of nonlinear layers significantly increase the roughness of the optimization landscape as network depth increases. Developing prior work, we quantify this effect and show that for networks at initialization, the strength of this effect depends on the smoothness of the nonlinear layer used. We then demonstrate how residual connections and multi-path architectures reduce high frequencies in the optimization landscape, resulting in increased trainability. Second, we found that normalization layers combined with an adequate warm-up scheme compensate for the increasing roughness in lower layers by dynamically re-scaling the layer-wise gradients. We prove that in a properly normalized network, all layer-wise effective learning speeds align over time, compensating for even exponentially exploding gradients at initialization. Finally, we conduct an empirical study to determine the necessary nonlinear depth of a network to generalize effectively on common deep learning tasks. Surprisingly, we find that a shallow network extracted after training significantly outperforms a comparably shallow network trained from scratch, although their expressivity is exactly the same. We also observe that ensembles of both shallow and deep paths outperform comparable networks comprised of only deep paths, even when extracted after training. Using these insights, we aim to gain a deeper understanding of how to design deep neural networks with high trainability and strong generalization properties.

DOI

http://doi.org/10.25358/openscience-11249

URI

https://openscience.ub.uni-mainz.de/handle/20.500.12030/11270

Collections

JGU-Hochschulschriften

Full item page

On the loss landscape of deep neural networks

Files

Date issued

Authors

Editors

Journal Title

Journal ISSN

Volume Title

Publisher

License

Abstract

DOI

Description

Keywords

Citation

URI

Relationships

Collections