페이지

2022년 2월 28일 월요일

Networks for sequence data

 In addition to image data, natural language text has also been a frequent topic of interest in neural network research. However, unlike the datasets we've examined thus far, language has a distinct order that is important to its meaning. Thus, to accurately capture the patterns in language or time-dependent data, it is necessary to utilize networks designed for this purpose.

AlexNet architecture

 While the architecture of AlexNet shown in Figure 3.12 might look intimidating, it is not so difficult to understand once we break up this large model into individual processing steops. Let's start with the input images and trace how the output classification is computed for each image through a series of transformations performed by each subsequent layer of the neural network.

The input image to AlexNet are size 244 * 244 * 3(for RGB channels). The first layer consists of groups of 96 units and 11 * 11 * 3 kernels; the output is response normalized(as described previously) and max pooled. Max pooling is an operation that takes the maximum value over an n * n grid to register whether a pattern appeared anywhere in the input; this is again a form of positional invariance.

The second layer is also a set of kernels of size 5 * 5* 8 in groups of 256. The third through to fifth hidden layers have additional convolutions, without normalization, followed by two fully connected layers and an output of size 1,000 representing the possible iamge classes in ImnageNet. The authors of AlexNet used several GPUsto train the model, and this acceleration is important to the output.

Looking at the features learned during training in the initial 11 * 11 * 3 convolutions(Figure 3.13), we can see recongnizable edges and colors. While the authors of AlexNet don't show examples of neuraons higher in the network that synthesize these basic features, an iullustration is provided by another study in which researchers trained a large CNN to classify images in YouTube videos, yielding a neuraon in the uppper reaches of the network that appeared to be a cat detector.

This overview should give you an idea of why CNN architectures look the way they do, and what developments have allowed them to become more tractable as the basis for image classifiers or image-based generative models over time. We will not turn to secound class of more specialized architectures-RNN-that's used to develop time or sequence-based models.


AlexNet and other CNN innovations

 A 2012 article that produced state-of-the-art results classifying the 1.3 milliion images in ImageNet into 1,000 classes useing a model termed AlexNet demostrates some of the later innovations that made training these kinds of models practical. One, as I've alluded to before, is using ReLU in place of sigmoids or hyperbolic tangent function. A ReLU is function of the form:

y = max(0,x)

In contract to the sigmoid function, or tanh, in which the derivative shrinks to 0 as the function is saturated, the ReLU function has a constant gradient and a discontinuity at 0(Figure 3.10). This means that the gradient does not saturate and causes deeper layers of the network to train more slowly, leading to intractable optimization.

While advantageous due to non-vanising gradients and their low computational requirements (as they are simply thresholded linear transforms), ReLU function have the downside that they can "turn off" if the input falls below 0, leading again to 1 0 gradient. This deficiency was resolved by later work in which a "leak" below 0 was introduced:

y = x if x>0, else 0.01x

A further refinement is to make this threshold adaptive with a slope a, the Parameterized Leak ReLU(PReLU).

y = max(ax, x) fi a <= 1

Another trick used by AlexNet is dropout. The idea of dropout is inspired by ensemble methods in which we average the predictions of many model to obtain more robust result. Clearly for deep neural networks this is prohibitive; thus a compromise is to randomly set the values of subset of neurons to 0 with a probability of 0.5. These values are reset with every forward pass of backpropagation, allowing the network to effectively sample different architectures since the "dropped out" neurons don't participate in the output in that pass.

Yet anoter enhancement used in AlexNet is local response normalization. Even though ReLUs don't saturate in the same manner as other units, the authors of the model still found value in constraining the range of output. For example, in an individual kernel, they normalized the input using values of adjacent kernels, mearning the overall response was rescaled.

where a is the unnormalized output at a given x, y location on an image, the sum over j is over adjacent kernels, and B, k, and alpha are hyperparameters. This rescaling is reminiscent of a later innovation used widely in both convolutional and other neural network architectures, batch normalizaiton. Batch normalization also applies a transformation on "raw" activations within a network:

 where x is the unnormalized output, and B and y are scale and shift parameters. This transformation is widely applied in manyu neural network architectures to accelerate trainning, through the exact reason why it is effective remains a topic of debate.

Now that you have an idea of some of the methodological advances that made training large CNNs possible, let's examine the structure of AlexJNet to see some additional architectureal components that we will use in the CNNs we implement in generative models in later chapters.

Early CNNs

 This idea of columns inspired early research into CNN architectures. Instead of learning individual weights between units as in a feedforward network, this architecture uses shared weights within a group of neurons specialized to detect a specific edge in an image. The initial layer of the network (denoted H1) consists of 12 groups of 64 neurons each. Each of these groups is derived by passing a 5 * 5 grid over the 16 * 16-pixel input image; each of the 64 5*5 grids in this group share the same weights, but are tired to different spatial regions of the input. You can see that there must be 64 neurons in each group to cover the input image if their receptive fields overlap by two pixels.

When combined, these 12 groups of neurons in layer H1 form 12 8*8 grids representing the presence or absence of a particular edge within a part of the image - the 8 * 8 grid is effectively a down-sampled version of the image(Figure 3.9). This weight sharing makes intutive sense in that the kernel represented by the weight is specified to detect a distinct color and/or shape, regardless of where it appears in the image. An effect of this down-sampling is a degree of positional invariance; we only know the edge occurred somewhere within a resion of the image, but not the exact location due to the reduced resolution from downsampling. Because they are computed by multiplying a 5*5 matrix(kernel) with a part of the image, an operation used in image blurring and other transformations, these 5*5 input features are known as convlutional kernels, and give the network its name.

Once we have these 128*8 downsampled versions of the image, the next layer(H2) also has 12 groups of neurons; here, the kernels are 5*5*8 - they traverse the surface of an 8*8 map from H1, across 8 of the 12 troups. We need 16 neurons of these 5*5*8 groups since a 5*5 grid can be moved over four times up and down on an 8 * 8 grid to cover all the pixels in the 8 * 8 grid.

Just like deeper cells in the visual cortex, the deeper layers in the network integrate across multiple columns to combine information from different edge detectors together.

Finally, the third hidden layer of this network (H3) contains all-to-all connections between 30 hidden units and the 12 * 16  units in the H2, just as in a traditional feedforward network; a final output of 10 unit classifies the input image as one of 10 hand-drawn digits.

Through weight sharing, the overall number of free parameters in this network is reduced, though it is still large in absolute terms. While backpropagation was used successfully for this task, it required a carefully designed network for a rather limited set of images with a restricted set of outcomes - for real-world applications, such as detection objects from hundreds or thousands of possible categories, other approaches would be necessary.

2022년 2월 26일 토요일

Networks for seeing: Convolutional architectures

 As noted at the beginning of this chapter, one of the inspirations for deep neural network models is the biological nervous system. As researchers attempted to design computer vision systems that would mimic the functioning of the vissual system, they turned to the architecture of the retina, as revealed by physiological studies by neurobiologists David Huber and Torsten Weisel in the 1960s. As previously described, the physiologist Santiago Ramon Y Cajal provided visual evidence that neural structures such as the retina are arranged in vertical networks:

Huber and Weisel studied the retinal system in cats, showing how their perception of shapes is composed of the activity of individual cells arranged in a column. Each column of cells is designed to detect a specific orientation of an edge in an input image; images of complex shapes are stitched together from these simpler images.

Varieties of networks: Convolution and recursive

 Up until now we've primarily discussed the basics of neural networks by referencing feedforward networks, where every input is connected to every output in each layer. 

While these feedforward networks are useful for illustrating how deep networks are trained, they are only on class of a broader set of architectures used in modern applications, including generative models, Thus, before covering some of the techniques that make training large networks practical, let's review these alternative deep models.


The shortfalls of backpropagation

 While the backpropagation procedure provides a way to update interior weights within the network in a principled way, it has several shortcomings that make deep networks difficult to use in practice. One is the problem of vanishing gradients. In out derivation of the backpropagation formulas, you saw that gradients for weights depper in the network are product of successive partial derivatives from higher layers. In our example, we used the sigmoid function; if we plot out the value of the sigmoid and its first derivative, we can see a potential problem:


As the value of the sigmoid function increase or decrease towards the extremes (0 or 1, representing being either "off" or "on"), the values of the gradient vanish to near zero. This means that the updates to w and b, which are products of these gradients from hidden activation functions y, shrink towards zero, making the weights change little between iterations and making the parameters of the hidden layer neurons change very slowly during backpropagation. Clearly one problem here is that the sigmoid function saturates; thus, choosing another nonliearity might circumvent this problem (this is indeed one of the solutions that was proposed as the ReLU, as we'll cover later).

Another problem is more subtle, and has to do with how the network utilizes its available free parameters. As you saw in Chapter 1, An Introduction to Generative AI: "Drawing" Data from Models, a posterior probability of a variable can be computed as a product of a likelihood and a prior distribution. We can see deep neural networks as a graphical representation of this kind of probability: the ouput of the neuraon, depending updon its parameters, is a product of all the input values and the distributions on those inputs (the priors). A problem occurs when those values become tightly coupled. As an illustration, consider the competing hypotheses for a headache:


If a patient has cancer, the evidence is so overwhelming that whether they have a could or not profides no additional value; in essence, the vlaue of the two prior hypotheses becomes coupled because of the influence of one. This makes it intractable to compute the relative contribution of different parameters, particularly in a deep network; we will cover this problem in our discussion of Restricted Boltman Machine and Deep Belief Networks in Chapter 4, Teaching Networks to Generate Digits. As we will describe in more detail in that chapter, a 2006 study showed how to counteract this effect, and was one of the first demonstrations of tractable inference in deep neural networks, a breakthrough that relied upon a generative model that produced images of hand-drawn digits.

Beyond these concerns, other challenges in the more widespread adaption of neural networks in the 1990s and early 200-s were the availability of methods such as Support Vector Machines, Gradient and Stochastic Gradient Bootstring Models, Random Forests, and even penalized regression methods such as LASSO and Elastic Net, for classification and regression tasks.

While, in theory, deep neural networks had potentially greater representational power than these models since they built hierarchical representations of the input data through successive layers in contrast to the "shallow" representation given by a single transformation such as a regression weight or decision tree, in practice the Challenges of training deep networks made these "shallo" methods more attrative for practical applications. This was coupled with the fact that larger networks required tuning thousands or even millions of parameters, requiring larg-scale matrix calculations that were infeasible before the explosion of cheap compute resources - including GPUs and TPUs especially suited to rapid matrix calculations - available from cloud vendors made these experiments practical.

Now that we've covered the basics of training simple network architectures, let's turn to more complex models that will form the building blocks of many of the generative models in the rest of the book: CNNs and sequence models (RNNs, LSTMs, and other).