A. Ceni, "Random Orthogonal Additive Filters: A Solution to the Vanishing/Exploding Gradient of Deep Neural Networks," in IEEE Transactions on Neural Networks and Learning Systems, doi: 10.1109/TNNLS.2025.3538924.
Abstract: Since the recognition in the early 1990s of the vanishing/exploding (V/E) gradient issue plaguing the training of neural networks (NNs), significant efforts have been exerted to overcome this obstacle. However, a clear solution to the V/E issue remained elusive so far. The pursuit of approximate dynamical isometry, i.e., parameter configurations where the singular values of the input–output Jacobian (IOJ) are tightly distributed around 1, leads to the derivation of an NN’s architecture that shares common traits with the popular residual network (ResNet) model. Instead of skipping connections between layers, the idea is to filter the previous activations orthogonally and add them to the nonlinear activations of the next layer, realizing a convex combination between them. Remarkably, the impossibility of the gradient updates to either vanish or explode is demonstrated with analytical bounds that hold even in the infinite depth case. The effectiveness of this method is empirically proved by means of training via backpropagation an extremely deep multilayer perceptron (MLP) of 50k layers, and an Elman NN to learn long-term dependencies in the input of 10k time steps in the past. Compared with other architectures specifically devised to deal with the V/E problem, e.g., LSTMs, the proposed model is way simpler yet more effective. Surprisingly, a single-layer vanilla recurrent NN (RNN) can be enhanced to reach state-of-the-art performance, while converging super fast; for instance, on the psMNIST task, it is possible to get test accuracy of over 94% in the first epoch, and over 98% after just ten epochs.