This idea of columns inspired early research into CNN architectures. Instead of learning individual weights between units as in a feedforward network, this architecture uses shared weights within a group of neurons specialized to detect a specific edge in an image. The initial layer of the network (denoted H1) consists of 12 groups of 64 neurons each. Each of these groups is derived by passing a 5 * 5 grid over the 16 * 16-pixel input image; each of the 64 5*5 grids in this group share the same weights, but are tired to different spatial regions of the input. You can see that there must be 64 neurons in each group to cover the input image if their receptive fields overlap by two pixels.
When combined, these 12 groups of neurons in layer H1 form 12 8*8 grids representing the presence or absence of a particular edge within a part of the image - the 8 * 8 grid is effectively a down-sampled version of the image(Figure 3.9). This weight sharing makes intutive sense in that the kernel represented by the weight is specified to detect a distinct color and/or shape, regardless of where it appears in the image. An effect of this down-sampling is a degree of positional invariance; we only know the edge occurred somewhere within a resion of the image, but not the exact location due to the reduced resolution from downsampling. Because they are computed by multiplying a 5*5 matrix(kernel) with a part of the image, an operation used in image blurring and other transformations, these 5*5 input features are known as convlutional kernels, and give the network its name.
Once we have these 128*8 downsampled versions of the image, the next layer(H2) also has 12 groups of neurons; here, the kernels are 5*5*8 - they traverse the surface of an 8*8 map from H1, across 8 of the 12 troups. We need 16 neurons of these 5*5*8 groups since a 5*5 grid can be moved over four times up and down on an 8 * 8 grid to cover all the pixels in the 8 * 8 grid.
Just like deeper cells in the visual cortex, the deeper layers in the network integrate across multiple columns to combine information from different edge detectors together.
Finally, the third hidden layer of this network (H3) contains all-to-all connections between 30 hidden units and the 12 * 16 units in the H2, just as in a traditional feedforward network; a final output of 10 unit classifies the input image as one of 10 hand-drawn digits.
Through weight sharing, the overall number of free parameters in this network is reduced, though it is still large in absolute terms. While backpropagation was used successfully for this task, it required a carefully designed network for a rather limited set of images with a restricted set of outcomes - for real-world applications, such as detection objects from hundreds or thousands of possible categories, other approaches would be necessary.
댓글 없음:
댓글 쓰기