In Figure 5.1, you can see an example of images from the CIFAR-10 dataset, along with an example of an early VAE algorithm that can generate fuzzy versions of these images based on a random number input:
More recent work on VAE networks has allowed these models to generate much better images, as you will see later in this chapter. To start, let's revisit the problem of generating MNIST digits and how we can extend this approach to more complex data.
Recall from Chapter 1, An Introduction to Generative AI:"Drawing" Data from Models and Chapter 4, Teching Networks to Generate Digit that the RBM(or DBN) model in essence involves learning the posterior probability distribution for images(x) given some latent "code"(z), represented by the hidden layer(s) of the network, the "marginal likelihood" of x:
We can see z as being an "encoding" of th eimage x (for example, the activations of the binary hidden units in the RBM), which can be decoded(for example, run the RBM in reverse in order to sample an image) to get a reconstruction of x. If the encoding is "good," the reconstruction will be close to the original image. Because these networks encode and decode representations of their input data, they are also known as "autoencoders."
The ability of deep neural networks to capture the underlying structure of complex data is one of their most attractive features; as we saw with the DBN model in Chapter 4, Teaching Networks to Generate Digits, it allows us to improve the performace of a clasifier by creating a better underlying model for the distrubution of the data. It can also be used to simply create a better way to "compress" the complexity of data, in a similar way to principal component analysis(PCA) in clasical statistice. In Figure 5.2, you can see how the stacked RBM model can be used as a way to encode the distribution of faces, for example.
We start with a "pre-training" phase to create a 30-unit encoding vector, which we then calibrate by forcing it to reconstruct the input image, before fine-tuning with standard backpropagation:
As an example of how the stacked RBM model can more effectively represent the distribution of images, the authors of the paper Reducing the Dimensionality of Data with Neural networks, from which Figure 5.2 is derived, demonstrated using a two-unit code for the MNIST digits compared to PCA:
On the left, we see the digits 0-9(represented by different shades and shapes) encoded using 2-dimensional PCA. Recall that PCA is generated using a low-dimensional factorization of the covariance matrix of the data:
Where Cov(X) is the same height/width M as the data(for example, 28 by 28 picels in MNIST) and U and V are both lower dimensional (M * K and K * M), where k is much smaller than M. Because they have a smaller number of rows/columns k than the original data in one dimension, U and V are lower-dimensional represetations of the data, and we can get an encoding of an individual image by projecting it onto these k vectores, giving a k unit encoding of the data. Since the decomposition (and projection) is a linear transformation (multiplying two matrices), the ability of the PCA components to distinguish data well depends on the data being linearly separable(we can draw a hyperplane through the space between groups-that space could be two-dimensional or N dimensional, like the 784 pixels in the MNIST images).
As you can see in Figure 5.3, PCA yields overlapping codes for the images, showing that it is challenging to represent digits using a two-component linear decomposition, in which vector representing the same digit are close together, while those representing different digits are clearly separated. Conceptually, the neural network is able to capture more of the variation between images representing different digits than PCA, as shown by its ability to separate the representations of these digits more clearly in a two-dimensional space.
As an analogy to understand this phenomenon, consider a very simple two-dimensional dataset consisting of parallel hyperbolas (2 power polynomials)
At the top, even though we have two distinct classes, we cannot draw a straight line through two-dimensional space to separate the two groups; in a neural network, the weight matrix in a single layer before the nonlinear transformation of a sigmoid or tanh is, in essence, a linear boundary of this kind. However, if we apply a nonlinear transformation to our 2D coordinates, such as taking the square root of the hyperbolas, we can create two separable planes(Figure 5.4, bottom).
A similar phenomenon is at play with our MNIST data: we need a neural network in order to place these 784-digit images into distinct, separable regions of space.
This goal is achieved by performing a non-linear transformation ono the original, overlapping data, with an objective funciton that reqards increasing the spatial separation among vectors encoding the images of different digits. A separable representation thus increases the ability of the neural network to differentiate image clases using these representations. Thus, in Figure 5.3, we can see on the right that applying the DBN model creates the required non-linear trasformation to separate the different images.
Now that we've covered how neural networks can compress data into numercal vectors and what some desirable properties of those vector representation are, we'll examine how to optimally compress information in these vectors. To do so, each element of the vector should encode distinct information from the others, a property we can achieve using a variational objective. This variational objective is the building block for creating VAE networks.
댓글 없음:
댓글 쓰기