While large research funding for neural networks declined until the 1980s after the publication of Perceptrons, researchers still recongnized that these models had value, particularly when assembled into multi-layer networks, each composed of several perceptron units. Indeed, when the mathematical form of the output function (that is, the output of the model) was relaxed to take on manu forms (such as a linear function or a sigmoid), these networks could solve both regression and classification problems, with theoretical results showing that 3-layer networks could effectively approximate any output. However, none of this work address the practical limitations of computing the solutions to these models, with rules such as the perceptron learning algorithm described earlier proving a great limitation to the applied use of them.
Renewed interest in neural networks came with the popularization of the backpropagation algorithm, which, while discovered in the 1960s, was not widely applied to neural networks until the 1980s, following serveral studies highlighting it usefulness for learning the weights in these models. As you saw with the perceptron model, a learning urle to update weights is relatively easy to derive as long as there are no "hidden" layers. The input is transformed once by the perceptron to compute an output value, meaning the weights can be directly tuned to yield the desired output. When there are hidden layers between the input and output, the problem becomes more complex: when do we change the internal weights to compute the activations that feed into the final output? How do we modify them in relation to the input weights?
Then insight of the backpropagation technique is that we can use the chain rule from calculus to efficiently compute the derivatives of each parameter of a network with respect to a loss function and, combined with a learning rule, this provides a scalable way to train multilayer networks.
Let's illustrate backpropagation with an example: consider a network like the one shown in Figure 3.3. Assume that the output in the final layer is computed using a sigmoidal function, which yields a value between 0 and 1:
Furthermore, the value y, the sum of the inputs to the final neuron, is a weighted sum of the sigmoidal inputs of the hidden units:
We also need a notion of when the network is performing well or badly at its task. A straightforward error function to use here is squared loss:
where yhat is the estimated value (from the output of the model) and y is the real value, summed over all the input examples J and the output of the network K (where K=1, since there is only a single output value). Backpropagation begins with a "forwar pass" where we compute the values of all the outputs in the inner and outer layers, to obtain the estimated values of yhat. We then proceed with a backward step to compute gradients to update the weights.
Our overall objective is to compute partial derivatives for the weights w and bias terms b in each neuron:&E/&w and &E/&b, which will allow us to compute the updates for b and w. Towards this goal, let's start by computing the update rule for the inputs in the final neuron; we want to date the partial derivative of the error E with respect to each of these inputs(in this example there are five, corresponding to the five hidden layer neurons), using the chain rule:
which for an individual example is just the difference between the input and output value. we need to take the partial derivative of the sigmoid function:
Putting it all together, we have:
If we want to compute the gradient for a particular parameter of x, such as a weight w or bias term b, we need on more step:
We already know the first term and x depends on w only through the inputs from the lower layers y since it is a linear function, so we obtain:
If we want to compute this derivative for one of the neurons in the hidden layer, we likewise take the partial derivative with respect to this input y, which is simply:
So, in total we can sum over all units that feed into this hidden layer:
We can repeat this process recursively for any units in deeper layers to obtain the desired update rule, since we now know how to calculate the gradients for y or w at any layer. This makes the process of updating weights efficient since once we h ave computed the gradients through the backward pass we can combine consecutive gradients through the layers to get the required gradient at any depth of the network.
Now that we have the gradients for each w (or other parameter of the neuron we might want to calculate), how can we make a "learning rule" to update the weight? In their paper, Hinton et al. noted tat we could apply an update to the model parameters after computing gradients on each sample batch but suggested instead applying an update cakculated after averaging over all samples. The gradient represents the direction in which the error function is changing with the greatest magnitude with respect to the parameters; thus, to update, we want to push the weight in the opposite direction, with (w) the update, and e a small value (a step size):
Then at each time t during training we update the weight using this calculated gradient:
where alpha is a decay parameter to weight the contribution of prior updates ranging from 0 to 1. Following this procedure, we would initialize the weights in the network with some small random values, choose a step size e and iterate with forward and backward passes, along with updates to the parameters, until the loss function reaches some desired value.
Now that we have described the formal mathematics behind backpropagation, let us look at how it is implemented in practice in software packages such as TensorFlow 2.