Limit(0)

2022년 3월 17일 목요일

Importing CIFAR

Now that we've discussed the underlying theory of VAE algorithms, let's start building a practical example using a real-world dataset. As we discussed in the introduction, for the experiments in this chapter, we'll be working with the Canadian Institute for Advanced Research (CIFAR) 10 dataset. The images in this dataset are part of a larger 80 million "small image" dataset, most of which do not have class labels like CIFAR-10, the labels were initially created by student volunteers, and the larger tiny images dataset allows researchers to submit labels for parts of the data.

Like the MNIST dataset, CIFAR-10 can be downloaded using the TEnsorFlow dataset's API:

import tensorflow.compat.v2 as tf

import tensorflow_datasets as tfds

cifar10_builder = tfds.builder("cifar10")

cifar10_builer.download_and_prepare()

This will download the dataset to disk and make it available for our experiments. To split it into training and test sets, we can use the following commands:

cifar10_train = cifar10_builder.as_dataset(split="train")

cifar10_test = cifar10_builder.as_dataset(split="test")

Let's inspect one of the images to see what format it is in:

cifar10_train.take(1)

The output tells us that each image in the dataset is of format <DatasetV1Adapter shapes: {image: (32,32,3), label: ()}, types: {image:tf.uint8, label: tf.int64}>:Unlike the MNIST dataset we used in Chapter 4, Teaching Networks to Generate Digits, the CIFAR images have three color channels, each with 32 * 32 pixels, while the label is an integer from 0 to 9(representing on of the 10 classes). We can also plot the images to inspect them visually:

from PIL import Image

import numpy as np

import matplotlib.pyplot as plt

for sample in cifar10_train.map(lambda x: flatten_image(x, label=True)).take(1):

plt.imshow(sample[0].numpy().reshape(32,32,3), astype(np.float32), cmap=plt.get_cmap("gray"))

print("Label:L %d" % sample[1].numpy())

This gives the following outpt:

Like the RBM model, the VAE model we'll build in this example has an output scaled between 1 and 0 and accepts flattened versions of the images, so we'll need to turn each image into a vector and scale it to maximum of 1:

def flattern_image(x, label=False):

if label:

return (tf.divide(tf.dtypes.cast(tf.reshape(x["image"], (1, 32*32*3)), tf.floate32), 256.0), x["label"])

else:

return ( tf.divide(tf.dtypes.cast(tf.reshape(x["image"], (1, 32*32*3)), tf.float32), 256.0))

This results in each image being a vector of length 3072(32*32*3), which we can reshape once we've run the model to examine the generated images.

Inverse Autoregressive Flow

In our discussion earlier, it was noted that we want to use q(z|x) as a way to approximate the "true" p(z|x) that would allow us to generate an ideal encoding of the data, and thus sample from it to generate new images. So far, we've assumed that q(z|x) has a relatively simple distribution, such as a vector of Gaussian distribution random variables that are independent(a diagonal covariance matrix with 0s on the non-diagonal elements). This sort of distribution has many benefits; because it is sample, we have an easy way to generate new samples by drawing from random normal distributions, and because it is independent, we can separately tune each element of the latent vector z to influence parts of the output image.

However, such a simple distribution may not fit the desired output distribution of data well, increasing the KL divergence between p(z|x) and q(z|x). Is ther a way we can keep the desirable properties of q(z|x) but "transform" z so that it captures more of the complexities needed to represent x?

One approach is to apply a series of autoregressive transformations to z to turn it from a simple to a complex distribution; by "autoregressive," we mean that each transformation utilizes both data from the previous transformation and the current data to compute an updated version of z. In contrast, the basic form of VAE that we introduced above has only a single "transformation:" from z to the output(though z might pass through multiple layers, there is no recursive network link to further refine that output). We've seen such transformations before, such as the LSTM networks in Chapter 3, Building Blocks of Deep Neural Networks, where the output of the network is combination of the current input and a weighted version of prior time step.

An attractive property of the independent q(z|x) distributions we discussed earlier, such as independent normals, is that they have a very tractable expression for the log likelihood. This property is important for the VAE model because its objective function depends on integrating over the whole likelihood function, which would be cumbersome for more complex log likelihood functions. However, by constraining a transformed z to computation through a series of autoregressive transformations, we have the nice property that the log-lieklihood of step t only depends on t-1, thus the jacobian (gradient matrix of the partial derivative between t and t-1)is lower triangular and can be computed as a sum:

What kinds of trasformations f could be used? Recall that after the parameterization trick, z is a function of a noise element e and the mean and standard deviation output by the encoder Q:

If we apply successive layers of transformation, step t becomes the sum of u and the element-wise product of the prior layer z and the sigmoidal output

In practice, we use a neural network transformation to stabilize the estimate of the mean at each step:

Again, note the similarity of this transformation to the LSTM networks discussed in Chapter 3, Building Blocks of Deep Neural Networks, In Figure 5.8, there is another output (h) from the encoder Q in addition to the mean and standard deviation in order to sample z. H is, in essence, "accessory data" that is passed into each successive transformation and, along with the weighted sum that is being calculated at each step, represents the "persistent memory" of the network in a way reminiscent of the LSTM.

2022년 3월 13일 일요일

The reparameterization trick

In order to allow us to backpropagate through our autoencoder, we need to transform the stochastic samples of z into a deterministic, differentiable transformation. We can do this by reparameterizing z as a function of a noise variable E:

Once we have smapled from E, the randomness in z no longer depends on the parameters of the variational distribution Q(the encoder), and we can backpropagate end to end. Our network now look like Figure 5.7, and we can optimize our objective using random samples of E(for example, a standard normal distribution).

This reparameterization moves the "random" node out the encoder/decoder framework so we can backpropagate through the whole system, but it slao has a subtler advantage; it reduces the variance of these gradients. Note that in the un-reparameterized netwrk, the distribution of z depends on the parameters of the encoder distribution Q; thus, as we are changing the parameters of Q, we are also changeing the distribution of z, and we would need to potentially use a large number of samples to get a decent estimate.

By reparameterizing, z now depends only on our simpler function, g, with randomness introduced through sampling E from a standard normal (that doesn't depend on Q): hence, we've removed a somewhat circular dependency, and made the gradients we are estimating more stable:

Now that you have seen how the VAE network is constructed, let's discuss a further refinement of this algorithm that allows VAEs to sample from complex distribution;

Inverse Autoregressive Flow(IAF).

The variational objective

We previously covered several examples of how images can be compressed into numerical vectors using neural networks. This section will introduce the elements that allow us to create effective encodings to sample new images from a space of random numerical vectors, which are principally efficient inference algorithms and appropriate objective functions, Let's start by quantifying more rigorously what make such an encoding "good" and allows us to recreate images well. We will need to maximize the posterior:

p(z|w) = p(x|z)p(z)/p(x)

A problem occurs when the probability of x is extremely high dimensional, which, as you saw, can occur in even simple data such as binary MNIST digits, where we have 2^(number of pixels) possible configurations that we would need to integrate over (in a mathematical sense of integrating over a probability distribution) to get a measure of the probability of an individual image; in other words, the density p(x) is intractable, making the posterior p(z|x), which depends on p(x), likewise intractable.

In some cases, such as you saw in Chapter 4, Teaching Networks to generate Digits, we can use simple cases such as binary units to compute an approximation such as contrastive divergence, which allows us to still compute a gradient even if we can't calculate a closed from. However, this might also be challenging for very large datasets, where we would need to make many passes over the data to compute an average gradient using Contrastive Divergence(CD), as you saw previously in Chapter 4, Teaching Networks to Generate Digits.

If we can't calculate the distribution of our encoder p(z|x) directly, maybe we could optimaize an approximation that is "close enough"-let's called this q(z|x). Then, we could use a measure to determine if the distributions are close enough. One useful measure of closeness is whether the two ditributions encode similar information; we can quantify information using the Shannon Information equation:

l(p(x)) = -log(p(x))

Consider why this is a good measure: as p(x) decreases, an event becomes rarer, and thus observation of the event communicates more information about the system or dataset, leading to a positive value of -log(p(x)). Conversely, as the probability of an event nears 1, that event encodes less information about the dataset, and the value of -log(p(x)) becomes 0

Thus, if we wanted to measure the difference between the information encoded in two distributions, p and q, we could use the difference in their information:

l(p(x)) - l(q(x)) = -log(p(x)) + log(q(x)) = log(q(x)/p(x))

Finally, if we want to find the expected difference in information between the distributions for all elements of x, we can take the average over p(x):

This quantity is known as the Kulback Leibler(KL) Divergence. It has a few interesting properties:

1. It is not symmetric: KL(p(x), q(x)) does not, in general, equal KL(q(x)), p(x)), so the "closeness" is measured byu mapping one distribution to another in a particular direction.

2. Whenever q(x) and p(x) match, the term is 0, meaning they are a minimum distance from one another, Likewise, KL(p(x)), q(x)) is 0 only if p and q are identical.

3. If q(x) is 0 or p(x) is 0, then KL is undefined; by definition, it only computes relative information over the range of x where the two distributions match.

4. KL is always greater than 0.

If we were to use the KL divergence to compute how well an approximation q(z,x)is of our intractable p(z|x), we could write:

Now we can write an expression for our intractable p(x) as well: since log(p(x)) does not depend on q(z|x), teh expectation with respect to p(x) is simply log(p(x)). Thus, we con represent the objective of the VAE, learning the marginal distributiton of p(x), using the KL divergence:

The second term is also known as the Variational Lower Bound, which is also referred to as the Evidence Lower Bound(ELBO); since KL(q,p) is strictly greater than 0, log(p(x)) is strictly greater than or (if KL(q,p) is 0) equal to this value.

To explain what this objective is doing, notice that the expectation intruduces a difference between q(z|x)(encoding x) and p(x|z)p(z) (thie joint probability of the data and the encoding); thus we want to minimize a lower bound that is essentially the gap betwwen the probablity of the encoding and the joint probability of the encoding and data, with an error term given by KL(q,p), the difference betwwen a tractable approximation and intractable form of the encoder p(z|x), the difference between a tractable approximation and intractable form of the encoder p(z|x), We can imagine the functions Q(z|x) and P(x|z) being represented by two deep neural networks; one generates the latent code z(Q), and the other reconstructs x from this code(P). We can imagine this as an autoencoder setup, as above with the stacked RBM models, with an encoder and decoder:

We want to optimize the parameters of the encoder Q and the decoder P to minimize the reconstruction cost. One way to do this is to construct Monte Carlo samples to optimize the parameters $\varnothing \!\,$ of Q using gradient descent:

However, it has been found in practice that a large number of samples may be required in order for the variance of these gradient updates to stabilize.

We also have a practical problem here: even if we could choose enough sample to get a good approximation of the gradients for the encoder, out network ontains a stochasitc, non-differentiable step (sampling z) that we can't backpropagate through, in a similar way we couldn't backpropagate through the stochastic units in the RBM in Chapter 4, Teaching Networks to Generate Digits. Thus, our reconstruction error depends on samples from z, but we can't backpropagate through the step that generates these samples to tune the network end to end. Is there a way we can create a differentialbe decoder/encoder architecture while also reducting the variance of sample estimates? One of the main insights of the VAE is to enable this through the "reparameterization trick."

2022년 3월 12일 토요일

Creating separable encodings of images

In Figure 5.1, you can see an example of images from the CIFAR-10 dataset, along with an example of an early VAE algorithm that can generate fuzzy versions of these images based on a random number input:

More recent work on VAE networks has allowed these models to generate much better images, as you will see later in this chapter. To start, let's revisit the problem of generating MNIST digits and how we can extend this approach to more complex data.

Recall from Chapter 1, An Introduction to Generative AI:"Drawing" Data from Models and Chapter 4, Teching Networks to Generate Digit that the RBM(or DBN) model in essence involves learning the posterior probability distribution for images(x) given some latent "code"(z), represented by the hidden layer(s) of the network, the "marginal likelihood" of x:

We can see z as being an "encoding" of th eimage x (for example, the activations of the binary hidden units in the RBM), which can be decoded(for example, run the RBM in reverse in order to sample an image) to get a reconstruction of x. If the encoding is "good," the reconstruction will be close to the original image. Because these networks encode and decode representations of their input data, they are also known as "autoencoders."

The ability of deep neural networks to capture the underlying structure of complex data is one of their most attractive features; as we saw with the DBN model in Chapter 4, Teaching Networks to Generate Digits, it allows us to improve the performace of a clasifier by creating a better underlying model for the distrubution of the data. It can also be used to simply create a better way to "compress" the complexity of data, in a similar way to principal component analysis(PCA) in clasical statistice. In Figure 5.2, you can see how the stacked RBM model can be used as a way to encode the distribution of faces, for example.

We start with a "pre-training" phase to create a 30-unit encoding vector, which we then calibrate by forcing it to reconstruct the input image, before fine-tuning with standard backpropagation:

As an example of how the stacked RBM model can more effectively represent the distribution of images, the authors of the paper Reducing the Dimensionality of Data with Neural networks, from which Figure 5.2 is derived, demonstrated using a two-unit code for the MNIST digits compared to PCA:

On the left, we see the digits 0-9(represented by different shades and shapes) encoded using 2-dimensional PCA. Recall that PCA is generated using a low-dimensional factorization of the covariance matrix of the data:

Where Cov(X) is the same height/width M as the data(for example, 28 by 28 picels in MNIST) and U and V are both lower dimensional (M * K and K * M), where k is much smaller than M. Because they have a smaller number of rows/columns k than the original data in one dimension, U and V are lower-dimensional represetations of the data, and we can get an encoding of an individual image by projecting it onto these k vectores, giving a k unit encoding of the data. Since the decomposition (and projection) is a linear transformation (multiplying two matrices), the ability of the PCA components to distinguish data well depends on the data being linearly separable(we can draw a hyperplane through the space between groups-that space could be two-dimensional or N dimensional, like the 784 pixels in the MNIST images).

As you can see in Figure 5.3, PCA yields overlapping codes for the images, showing that it is challenging to represent digits using a two-component linear decomposition, in which vector representing the same digit are close together, while those representing different digits are clearly separated. Conceptually, the neural network is able to capture more of the variation between images representing different digits than PCA, as shown by its ability to separate the representations of these digits more clearly in a two-dimensional space.

As an analogy to understand this phenomenon, consider a very simple two-dimensional dataset consisting of parallel hyperbolas (2 power polynomials)

At the top, even though we have two distinct classes, we cannot draw a straight line through two-dimensional space to separate the two groups; in a neural network, the weight matrix in a single layer before the nonlinear transformation of a sigmoid or tanh is, in essence, a linear boundary of this kind. However, if we apply a nonlinear transformation to our 2D coordinates, such as taking the square root of the hyperbolas, we can create two separable planes(Figure 5.4, bottom).

A similar phenomenon is at play with our MNIST data: we need a neural network in order to place these 784-digit images into distinct, separable regions of space.

This goal is achieved by performing a non-linear transformation ono the original, overlapping data, with an objective funciton that reqards increasing the spatial separation among vectors encoding the images of different digits. A separable representation thus increases the ability of the neural network to differentiate image clases using these representations. Thus, in Figure 5.3, we can see on the right that applying the DBN model creates the required non-linear trasformation to separate the different images.

Now that we've covered how neural networks can compress data into numercal vectors and what some desirable properties of those vector representation are, we'll examine how to optimally compress information in these vectors. To do so, each element of the vector should encode distinct information from the others, a property we can achieve using a variational objective. This variational objective is the building block for creating VAE networks.

5. Painting Pictures with Neural Networks Using VAEs

As you saw in Chapter 4, Teaching Networks to Generate Digits, deep neural networks are a powerful tool for creating generative models for complex data such as images, allowing us to develop a network that can generate images from the MNIST hand-drawn digit database. In that example, the data is relatively simple; images can only come from a limited set of categories (the digits 09 through 9) and are low-resolution grayscale data.

What about more complex data, such as color images drawn from the real world? One example of such "real world" data is the Canadian Institute for Advanced Research 10 class dataset, denoted as CIFAR-10. It is a subset of 60,000 examples from a larger set of 80 million images, divided into ten classes - airplanes, cars, birds, cats, deep, dogs, frogs, horses, ships, and trucks. While still an extremely limited set in terms of the diversity of images we would encounter in the real world, these classes have some characteristics that make them more complex then MNIST. For example, the MNIST digits can vary in width, curvature, and a few other properties; the CIFAR-10 classes have a much wider potential range of variation for animal or vehicle photos, meaning we may require more complex models in order to capture this variation.

In this chapter, we will discuss a class of generative models known as Variational Autoencoders(VAEs), which are designed to make the generation of these complex, real-world images more tractable and tunable. Then do this by using a number of clever simplifications to make it possible to sample over the complex probability distribution represented by real-world images in a way that is scalable.

We will explore the following topics to reveal how VAEs works:

- How neural networks create low-dimensional representations of data, and some desirable properties of those representations

- How variational methods allow us to sample from complex data using these representations

- How using the reparameterization trick allows us to stabilize the variance of a neural network based on variational sampling- a VAE

- How we can use Inverse Autoregressive Flow(IAF) to tune the output of a VAE

- How to implement VAE/IAE in TensorFlow

Summary

In this cahpter, you learned about one of the most important models from the beginings of the deep learning revolution, the DBN. You saw that DBNs are constructed by stacking together RBNs, and how these undirected models can be trained using CD.

This chapter then describeed a greedy, layer-wise procedure for priming a DBN by sequentitally training each of a stack of RBMs, which can then be fine-truned using the wake-sleep algorithm or backpropagation. We then explored pracical examples of using the TensorFlow 2 API to create an RBM layer and a DBN model, illustraing the use of the GradientTape class to compute update using CD.

You also learned how, following the wake-sleep algorithm, we can compile the DBN as a normal Deep Neural Network and perform backpropagation for upervised training. We applied these models to MNIST data and saw how an RBM can generate digits after training converges, and has features resembling the convolutional filters described in Chapter 3, Building Blocks of Deep Neural Networks.

While the examples in the chapter involved significantly extending the basic layer and model classes of the TensorFlow keras API, they should give you an idea of how to implement your own low-level alternative training procedures. Going forward, we will mostly stick to using the standard fit() and predict() methods, starting with our next topic, Variational Autoencodres, a sophisticated and computationally efficient way to generate image data.