26 February 2025


Artificial deep neural networks (NNs) are computational models that have gathered massive interest in the last decade. Usually NNs are composed of many simple units (neurons) organised in multiple layers which nonlinearly interact to each other via trainable parameters, also called weights. Roughly speaking, training a NN to solve a task means to adapt its parameters in order to fit an unknown function mapping between raw input and the output labels forming the data points from which to generalise.
Although several algorithms exist for training NN models, the backpropagation algorithm with stochastic gradient descent (or variants thereof) has been established as a standard for supervised learning. The computation involves the sequential product of many matrices whose resulting norm, similarly to the product of many real numbers, can shrink to vanish or expand to infinity exponentially with depth. This problem has been known in the literature as the vanishing/exploding (V/E) gradient issue. The V/E issue has been firstly recognised in the context of recurrent neural networks (RNNs) in the 1990s, and then, after the advent of increasingly deep architectures, it became evident also in feedforward neural networks (FNNs).
However, a clear solution to the V/E issue remained elusive so far. In this work, EMERGE partners from the University of Pisa propose a method that permits to solve the V/E gradient issue of deep learning models trained via stochastic gradient descent (SGD) methods.
Instead of skipping connections between layers, the idea is to filter the previous activations orthogonally and add them to the nonlinear activations of the next layer, realizing a convex combination between them. Remarkably, the impossibility of the gradient updates to either vanish or explode is demonstrated with analytical bounds that hold even in the infinite depth case. The effectiveness of this method is empirically proved by means of training via backpropagation an extremely deep multilayer perceptron (MLP) of 50k layers, and an Elman NN to learn long-term dependencies in the input of 10k time steps in the past. Compared with other architectures specifically devised to deal with the V/E problem, e.g., LSTMs, the proposed model is way simpler yet more effective. Surprisingly, a single-layer vanilla recurrent NN (RNN) can be enhanced to reach state-of-the-art performance, while converging super fast; for instance, on the psMNIST task, it is possible to get test accuracy of over 94% in the first epoch, and over 98% after just ten epochs.
Read the paper in the link below.