While the architecture of AlexNet shown in Figure 3.12 might look intimidating, it is not so difficult to understand once we break up this large model into individual processing steops. Let's start with the input images and trace how the output classification is computed for each image through a series of transformations performed by each subsequent layer of the neural network.
The input image to AlexNet are size 244 * 244 * 3(for RGB channels). The first layer consists of groups of 96 units and 11 * 11 * 3 kernels; the output is response normalized(as described previously) and max pooled. Max pooling is an operation that takes the maximum value over an n * n grid to register whether a pattern appeared anywhere in the input; this is again a form of positional invariance.
The second layer is also a set of kernels of size 5 * 5* 8 in groups of 256. The third through to fifth hidden layers have additional convolutions, without normalization, followed by two fully connected layers and an output of size 1,000 representing the possible iamge classes in ImnageNet. The authors of AlexNet used several GPUsto train the model, and this acceleration is important to the output.
Looking at the features learned during training in the initial 11 * 11 * 3 convolutions(Figure 3.13), we can see recongnizable edges and colors. While the authors of AlexNet don't show examples of neuraons higher in the network that synthesize these basic features, an iullustration is provided by another study in which researchers trained a large CNN to classify images in YouTube videos, yielding a neuraon in the uppper reaches of the network that appeared to be a cat detector.
This overview should give you an idea of why CNN architectures look the way they do, and what developments have allowed them to become more tractable as the basis for image classifiers or image-based generative models over time. We will not turn to secound class of more specialized architectures-RNN-that's used to develop time or sequence-based models.
댓글 없음:
댓글 쓰기