As we saw in our discussion of back propagation, the original version proposed in 1986 for training neural networks averaged the loss over the entire dataset before taking the gradient and updating the weights. Obviously, this is quite slow and makes distributing the model difficult, as we can't split up the input data and model replicas; if we use them, each needs to have access to the whole dataset.
In contrast, SGD computes gradient updates after n samples, where n could a range from 1 to N, the size of the dataset. In practice, we usually perform mini-batch gradient descent, in which n is relatively small, and we randomize assignment of data to the n batches after each epoch (a single pass through the data).
However, SGD can be slow, leading researchers to propose alternatives that accelerate the each for a minimum. As seen in the original backpropagation algorithm, one idea is to use a form of exponentially weighted mementum that remembers prior steps and continues in promising directions. Variants have been proposed, such as Nesterou Momentum, which adds a term to increase this acceleration.
In comparison to the momentum term used in the original backpropagation algorithm, the addition of the current momentum term to the gradient helps keep the momentum component aligned with the gradient changes.
Another optimization, termed Adaptive Gradient(Adagrad), scales the learning rate for each update by the running the sum of squares(G) of the gradient of that parameter; thus, elements that are frequently updated are downsampled, while those that are infrequently updated are pushed to update with greater magnitude:
This approach has the downside that as we continue to train the neural network, the sum G will increase inndefinitely, ultimately sharinking the learning rate to a very samll value. To fix this shortcoming, two variant methods, RMSProp (frequently applied to RNNs) and AdaDelta impose fixed-width windows of n steps in the computation of G.
Adaptive Momentum Estimation (ADAM) can be seen as an attempt to combine momentum and AdaDelta; the momentum claculation is used to preserve the history of past gradient updates, while the sum of decaying squared gradients within a fixed update window used in AdaDelta is applied to scale the resulting gradient.
The methods mentioned here all share the property of being first order; they involve only the first derivative of the loss with respect to the input. While simple to compute, this may introduce partical challenges with navigating the complex solution space of neural network parameters. As shown in Figure 3.17, if we visualize the landscape of weight parameters as a ravine, then first-order methods will either move too quickly in areas in which the curvature is changing quickly (the top image) overshooting the minima, or will change too slowly within the minima "ravine," wher ethe curvature is low. An ideal algorithm would take into account not only the curvature but the rate of change of the curvature, allowing an optimizer order method to take larger step sizes when the curvature changes very slowly and vice versa(the bottom image).
Because they make use of the rate of change of the derivative (the second derivative), these methods are known as second order, and have demonstrated some success in optimizing neural network models.
However, the computation required for each update is larger than for first-order methods, and because most second-order methods iunvolve large matrix inversions (and thus memory utilization), approximations are required to make these methods scale. Ultimately, however, one of the breakthroughs in practically optimizing networks comes not just from the optimization algorithm, but how we initialize the weights in the model.