As noted previously, in earlier research it was common to initialize weights in a neural network with some range of random values. Breakthroughs in the training of Deep Belief Networks in 2006, as you will see in Chapter 4, Teaching Networks to Generate Digits, used pre-training (through a generative modeling approach) to initialize weights before performing standard backpropagation.
If you've ever used a layer in the TensorFlow Keras module, you will notice that the default initialization for layer weights draws from either a truncated normal or uniform distribution. Where does this choice come from? As I described previously, one of the challenges with deep networks using sigmoidal or hyperbolic activation functions is that they tend to become saturated, since the values for these functions are capped with very large or negative input. We might interpret the challenge of initializing networks then as keeping weights in such a range that they don't saturate the neuron's output. Another way to understand this is to assume that the input and output values of the neuron have similar variance; the signal is not massively amplifying or diminishing while passing through the neuron.
In practice, for a linear neuron, y = ws + h, we could compute the variance of the input and output as:
var(y) = var(ws + b)
The b is constant, so we are left with:
var(y) = var(w)var(x) + var(x)E(x)2 + var(x)E(w)2 = var(w)var(x)
Since there are N elements in the weight matrix, and we want var(y) to equal var(x), this gives:
1 = Nvar(x),var(w) = 1/N
Therefore, for a weight matrix w, we can use a truncated normal or uniform distribution with variance 1/N (the average number of input and output units, so the number of weights). Variations have also been applied to ReLU units. these methods are referred to by their original authors' names as Xavier or He initialization.
In summary, we've reviewed several common optimizers used under the hood in TensorFlow2, and discussed how they improve upon the basic form of SGD. We've also discussed how clever weight initialization schemes work together with these optimizers to allow us to train ever more complex models.