페이지

2022년 2월 28일 월요일

AlexNet and other CNN innovations

 A 2012 article that produced state-of-the-art results classifying the 1.3 milliion images in ImageNet into 1,000 classes useing a model termed AlexNet demostrates some of the later innovations that made training these kinds of models practical. One, as I've alluded to before, is using ReLU in place of sigmoids or hyperbolic tangent function. A ReLU is function of the form:

y = max(0,x)

In contract to the sigmoid function, or tanh, in which the derivative shrinks to 0 as the function is saturated, the ReLU function has a constant gradient and a discontinuity at 0(Figure 3.10). This means that the gradient does not saturate and causes deeper layers of the network to train more slowly, leading to intractable optimization.

While advantageous due to non-vanising gradients and their low computational requirements (as they are simply thresholded linear transforms), ReLU function have the downside that they can "turn off" if the input falls below 0, leading again to 1 0 gradient. This deficiency was resolved by later work in which a "leak" below 0 was introduced:

y = x if x>0, else 0.01x

A further refinement is to make this threshold adaptive with a slope a, the Parameterized Leak ReLU(PReLU).

y = max(ax, x) fi a <= 1

Another trick used by AlexNet is dropout. The idea of dropout is inspired by ensemble methods in which we average the predictions of many model to obtain more robust result. Clearly for deep neural networks this is prohibitive; thus a compromise is to randomly set the values of subset of neurons to 0 with a probability of 0.5. These values are reset with every forward pass of backpropagation, allowing the network to effectively sample different architectures since the "dropped out" neurons don't participate in the output in that pass.

Yet anoter enhancement used in AlexNet is local response normalization. Even though ReLUs don't saturate in the same manner as other units, the authors of the model still found value in constraining the range of output. For example, in an individual kernel, they normalized the input using values of adjacent kernels, mearning the overall response was rescaled.

where a is the unnormalized output at a given x, y location on an image, the sum over j is over adjacent kernels, and B, k, and alpha are hyperparameters. This rescaling is reminiscent of a later innovation used widely in both convolutional and other neural network architectures, batch normalizaiton. Batch normalization also applies a transformation on "raw" activations within a network:

 where x is the unnormalized output, and B and y are scale and shift parameters. This transformation is widely applied in manyu neural network architectures to accelerate trainning, through the exact reason why it is effective remains a topic of debate.

Now that you have an idea of some of the methodological advances that made training large CNNs possible, let's examine the structure of AlexJNet to see some additional architectureal components that we will use in the CNNs we implement in generative models in later chapters.

댓글 없음: