Informationflow in Deep ReLU Networks
Loading...
Date issued
Authors
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
Reuse License
Description of rights: CC-BY-SA-4.0
Abstract
Deep learning has proven its effectiveness in large parts of the scientific world. Even large-scale applications, especially text-to-image or text-to-text processors with billions of parameters, consist at their core of simple linear algebra, stacked and separated by non-linear functions. One such so-called activation function, Rectified Linear Unit (ReLU), is defined as the maximum of its argument with zero, effectively discretizing space into one of two cases: greater or smaller than zero. These mechanisms; a continuous basis (using linear algebra) and a discrete choice (using ReLU) seem sufficient to induce representations capable of tackling tasks such as Autonomous Driving or passing the Turing Test.
This thesis aims to explore the propagation of information in training deep ReLU networks, moving beyond the perspective of a solely continuous optimization process. By switching back and forth between these two ideas, continuous and discrete interpretation of the very same process, this work aims to explore different instances of the same underlying question: How does information flow from the dataset using the learning scheme through a deep network? One way to answer this question is to observe what discrete decisions a deep network implicitly makes during training and inference, leading to one of the key contributions of this work, which is to examine the activation patterns and their changes during training, enabling the analysis of architectural and optimization choices in a unified model of the training process. Using these insights, the thesis introduces ActCooLR, a proof-of-concept learning rate scheduler based on the introduced transition model of activation pattern changes. A second way to approach the question is to adaptively enhance the optimization process by incorporating additional discrete decisions using a stochastic number system during training, and monitoring optimization for this increasing difficulty.
