Limit(0)

2022년 3월 1일 화요일

Xavier initialization

As noted previously, in earlier research it was common to initialize weights in a neural network with some range of random values. Breakthroughs in the training of Deep Belief Networks in 2006, as you will see in Chapter 4, Teaching Networks to Generate Digits, used pre-training (through a generative modeling approach) to initialize weights before performing standard backpropagation.

If you've ever used a layer in the TensorFlow Keras module, you will notice that the default initialization for layer weights draws from either a truncated normal or uniform distribution. Where does this choice come from? As I described previously, one of the challenges with deep networks using sigmoidal or hyperbolic activation functions is that they tend to become saturated, since the values for these functions are capped with very large or negative input. We might interpret the challenge of initializing networks then as keeping weights in such a range that they don't saturate the neuron's output. Another way to understand this is to assume that the input and output values of the neuron have similar variance; the signal is not massively amplifying or diminishing while passing through the neuron.

In practice, for a linear neuron, y = ws + h, we could compute the variance of the input and output as:

var(y) = var(ws + b)

The b is constant, so we are left with:

var(y) = var(w)var(x) + var(x)E(x)2 + var(x)E(w)2 = var(w)var(x)

Since there are N elements in the weight matrix, and we want var(y) to equal var(x), this gives:

1 = Nvar(x),var(w) = 1/N

Therefore, for a weight matrix w, we can use a truncated normal or uniform distribution with variance 1/N (the average number of input and output units, so the number of weights). Variations have also been applied to ReLU units. these methods are referred to by their original authors' names as Xavier or He initialization.

In summary, we've reviewed several common optimizers used under the hood in TensorFlow2, and discussed how they improve upon the basic form of SGD. We've also discussed how clever weight initialization schemes work together with these optimizers to allow us to train ever more complex models.

Gradient descent to ADAM

As we saw in our discussion of back propagation, the original version proposed in 1986 for training neural networks averaged the loss over the entire dataset before taking the gradient and updating the weights. Obviously, this is quite slow and makes distributing the model difficult, as we can't split up the input data and model replicas; if we use them, each needs to have access to the whole dataset.

In contrast, SGD computes gradient updates after n samples, where n could a range from 1 to N, the size of the dataset. In practice, we usually perform mini-batch gradient descent, in which n is relatively small, and we randomize assignment of data to the n batches after each epoch (a single pass through the data).

However, SGD can be slow, leading researchers to propose alternatives that accelerate the each for a minimum. As seen in the original backpropagation algorithm, one idea is to use a form of exponentially weighted mementum that remembers prior steps and continues in promising directions. Variants have been proposed, such as Nesterou Momentum, which adds a term to increase this acceleration.

In comparison to the momentum term used in the original backpropagation algorithm, the addition of the current momentum term to the gradient helps keep the momentum component aligned with the gradient changes.

Another optimization, termed Adaptive Gradient(Adagrad), scales the learning rate for each update by the running the sum of squares(G) of the gradient of that parameter; thus, elements that are frequently updated are downsampled, while those that are infrequently updated are pushed to update with greater magnitude:

This approach has the downside that as we continue to train the neural network, the sum G will increase inndefinitely, ultimately sharinking the learning rate to a very samll value. To fix this shortcoming, two variant methods, RMSProp (frequently applied to RNNs) and AdaDelta impose fixed-width windows of n steps in the computation of G.

Adaptive Momentum Estimation (ADAM) can be seen as an attempt to combine momentum and AdaDelta; the momentum claculation is used to preserve the history of past gradient updates, while the sum of decaying squared gradients within a fixed update window used in AdaDelta is applied to scale the resulting gradient.

The methods mentioned here all share the property of being first order; they involve only the first derivative of the loss with respect to the input. While simple to compute, this may introduce partical challenges with navigating the complex solution space of neural network parameters. As shown in Figure 3.17, if we visualize the landscape of weight parameters as a ravine, then first-order methods will either move too quickly in areas in which the curvature is changing quickly (the top image) overshooting the minima, or will change too slowly within the minima "ravine," wher ethe curvature is low. An ideal algorithm would take into account not only the curvature but the rate of change of the curvature, allowing an optimizer order method to take larger step sizes when the curvature changes very slowly and vice versa(the bottom image).

Because they make use of the rate of change of the derivative (the second derivative), these methods are known as second order, and have demonstrated some success in optimizing neural network models.

However, the computation required for each update is larger than for first-order methods, and because most second-order methods iunvolve large matrix inversions (and thus memory utilization), approximations are required to make these methods scale. Ultimately, however, one of the breakthroughs in practically optimizing networks comes not just from the optimization algorithm, but how we initialize the weights in the model.

2022년 2월 28일 월요일

Building a better optimizer

In this chapter we have so far discussed several examples in which better neural network architectures allowed for breakthroughs; however, just as(and perhaps even more) important is the optimization procedure used to minimize the error function in these problems, which "learns" the parameters of the network by selecting those that yield the lowest error. Referring to our discussion of backpropagation, this problem has two components;

- How to initialize the weights: In many applications historically, we see that the authors used random weights within some range, and hoped that the use of backpropagation would result in at least a locally minimal loss function from this random starting point.

- How to find the local minimum loss: In basic backpropagation, we used gradient descent using a fixed learning rate and a first derivative update to traverse the potential solution space of weight matries; however, there is good reason to believe there might be more efficient ways to find a local minimum.

In fact, both of these have turned out to be key considerations towards progress in deep learning research.

RNN and LSTMs

Let's imagine we are trying to predict the next word in a sentence, given the words up until this point. A neural network that attempted to predict the next word would need to take into account not only the current word but a variable number of prior inputs. If we instead used only a simple feedforward MLP, the network would essentially process the entire sentence of each word as a vector. This introduces the problem of either having to pad variable-length inputs to a common length and not preserving any notion of correlation (that is, which words in the sentence are more relevant than others in generating the next prediction), or only using the last word at each step as the input, which removes the context of the rest of the sentence and all the information it can provide. This kind of problem inspired the "vanilla" RNN which incorporates not only the current input but the prior step's hidden state in computing a neuron's output:

One way to visualize this is to imagine each layer feeding recursively into the next timestep in a sequence. In effect, if we "unroll" each part of the sequence, we end up with a very deep neural network, where each layer shares the same weights.

The same difficulties that characterize training deep feedforward networks also apply to RNNs; gradients tend to die out over long distances using traditional activation functions (or explode if the gradients become greater than 1).

However, unlike feedforward networks, RNNs aren't trained with traditional backpropagation, but rather a variant known as backpropagation through time(BPTT): the network is unrolled, as before, and backpropagation is used, averaging over errors at eatch time point(since an "output," the hidden state, occurs at each step). Also, in the case of RNNs, we run into the problem that the network has a very short memory; it only incorporates information from the most recent unit before the current one and has trouble maintaining long-range context. For applications such as traslation, this is clearly a problem, as the interpretation of a ward at the end of a sentence may depend on terms near the beginning, not just those directly preceding it.

The LSTM network was developed to allow RNNs to maintain a context or state over long sequences.

In a vanilla RNN, we only maintain a short-term memory h coming from the prior step's hidden unit activations, In addition to this short-term memory, the LSTM architecture introduces an additional layer c, the "long-term" memory, which can persist over many timesteps. The design is in some ways reminiscent of an electrical capacitor, which can use the c layer to store up or hold "charge," and discharge it once it has reached some threshold. To compute these updates, an LSTM unit consists of a number of related neurons, or gates, that act together to transform the input at each time step.

Given an input vector x, and the hidden state h, at the previous time t-1, at each time step an LSTM first computers a value from 0 to 1 for each element of c representing what fraction of information is "forgotten" of each element of the vector:

We make a second, similar calculation to determine what from the input value to preserve:

We now know which elements of c are updated; we can compute this update as follows:

where o is a Hadamard product (element-wise multiplication). In essence this equation tells us how to compute updates using the tanh transform, filter them using the input gate, and combine them with the prior time step's long-term memory using the forget gate to potentially filter out old values.

To compute the output at each time step, we compute another output gate:

And to compute the final output at each step (the hidden layer fed as short-term memory to the next step) we have:

Many variants of this basic design have been proposed; for example, the "peephole" LSTM substituted h(t-1) with c(t-1)(thus each operation gets to "peep" at the longterm memory cell), while the GRU simplifies the overall design by removing the output gate. What these designs all have in common is that they avoid the vanishing (or exploding) gradient difficulties seen during the training of RNNs, since the long-term memory acts as a buffer to maintain the gradient and propagate neuronal activations over many timesteps.

Networks for sequence data

In addition to image data, natural language text has also been a frequent topic of interest in neural network research. However, unlike the datasets we've examined thus far, language has a distinct order that is important to its meaning. Thus, to accurately capture the patterns in language or time-dependent data, it is necessary to utilize networks designed for this purpose.

AlexNet architecture

While the architecture of AlexNet shown in Figure 3.12 might look intimidating, it is not so difficult to understand once we break up this large model into individual processing steops. Let's start with the input images and trace how the output classification is computed for each image through a series of transformations performed by each subsequent layer of the neural network.

The input image to AlexNet are size 244 * 244 * 3(for RGB channels). The first layer consists of groups of 96 units and 11 * 11 * 3 kernels; the output is response normalized(as described previously) and max pooled. Max pooling is an operation that takes the maximum value over an n * n grid to register whether a pattern appeared anywhere in the input; this is again a form of positional invariance.

The second layer is also a set of kernels of size 5 * 5* 8 in groups of 256. The third through to fifth hidden layers have additional convolutions, without normalization, followed by two fully connected layers and an output of size 1,000 representing the possible iamge classes in ImnageNet. The authors of AlexNet used several GPUsto train the model, and this acceleration is important to the output.

Looking at the features learned during training in the initial 11 * 11 * 3 convolutions(Figure 3.13), we can see recongnizable edges and colors. While the authors of AlexNet don't show examples of neuraons higher in the network that synthesize these basic features, an iullustration is provided by another study in which researchers trained a large CNN to classify images in YouTube videos, yielding a neuraon in the uppper reaches of the network that appeared to be a cat detector.

This overview should give you an idea of why CNN architectures look the way they do, and what developments have allowed them to become more tractable as the basis for image classifiers or image-based generative models over time. We will not turn to secound class of more specialized architectures-RNN-that's used to develop time or sequence-based models.

AlexNet and other CNN innovations

A 2012 article that produced state-of-the-art results classifying the 1.3 milliion images in ImageNet into 1,000 classes useing a model termed AlexNet demostrates some of the later innovations that made training these kinds of models practical. One, as I've alluded to before, is using ReLU in place of sigmoids or hyperbolic tangent function. A ReLU is function of the form:

y = max(0,x)

In contract to the sigmoid function, or tanh, in which the derivative shrinks to 0 as the function is saturated, the ReLU function has a constant gradient and a discontinuity at 0(Figure 3.10). This means that the gradient does not saturate and causes deeper layers of the network to train more slowly, leading to intractable optimization.

While advantageous due to non-vanising gradients and their low computational requirements (as they are simply thresholded linear transforms), ReLU function have the downside that they can "turn off" if the input falls below 0, leading again to 1 0 gradient. This deficiency was resolved by later work in which a "leak" below 0 was introduced:

y = x if x>0, else 0.01x

A further refinement is to make this threshold adaptive with a slope a, the Parameterized Leak ReLU(PReLU).

y = max(ax, x) fi a <= 1

Another trick used by AlexNet is dropout. The idea of dropout is inspired by ensemble methods in which we average the predictions of many model to obtain more robust result. Clearly for deep neural networks this is prohibitive; thus a compromise is to randomly set the values of subset of neurons to 0 with a probability of 0.5. These values are reset with every forward pass of backpropagation, allowing the network to effectively sample different architectures since the "dropped out" neurons don't participate in the output in that pass.

Yet anoter enhancement used in AlexNet is local response normalization. Even though ReLUs don't saturate in the same manner as other units, the authors of the model still found value in constraining the range of output. For example, in an individual kernel, they normalized the input using values of adjacent kernels, mearning the overall response was rescaled.

where a is the unnormalized output at a given x, y location on an image, the sum over j is over adjacent kernels, and B, k, and alpha are hyperparameters. This rescaling is reminiscent of a later innovation used widely in both convolutional and other neural network architectures, batch normalizaiton. Batch normalization also applies a transformation on "raw" activations within a network:

where x is the unnormalized output, and B and y are scale and shift parameters. This transformation is widely applied in manyu neural network architectures to accelerate trainning, through the exact reason why it is effective remains a topic of debate.

Now that you have an idea of some of the methodological advances that made training large CNNs possible, let's examine the structure of AlexJNet to see some additional architectureal components that we will use in the CNNs we implement in generative models in later chapters.