페이지

2022년 3월 25일 금요일

Maximum likelihood game

 The minimax game can be transformed into a maximum likelihood game where the aim is to maximize the likelihood of the generator probability density. This is done to ensure that the generator probability density is similar to the real/training data probability density. In other words, the game can be transformed into minimizeing the divergence between Pz and Pdata. To do so, we make use of kullback-Leibler divergence(KL divergence) to calculate the similarity betwen two distributions of interest. The overall value function can be denoted as:

The cost function for the generator transforms to:

One important point to note is that KL divergence is not a symmetric measure, that is, KL(Pdata || pg) != KL(Pg||Pdata). Themodel typically uses KL(Pg||Pdata) to achieve better results.

The three different cost function discussed so far have slightly different trajectories and thus load to different properties at different stages of training. These three functions can be visualized as shown in Figure 6.7:


Non-saturating generator cost

 I practice, we do not train the generator to minimize log(1-D(G(z))) as this function does not provide sufficient gradients for learning. During the initial learning phases, where G is poor, the discriminator is able to classify the fake from the real with high confidence. This leads to the saturation of log(1-D(G(z))), which hinders improvements in the generator model. We thus tweak the generator to maximize log(D(G(z))) instead:

This provides stronger gradients for the generator to learn. This is shown in Figure 6.6. The x-axis denotes D(G(z)). The top line shows the objective, which is minimizing the likelihood of the discriminator being correct. The bottom line(updated objective) works by maximizing the likelihood of the discirimiator being wrong

Figure 6.6 illustrates how a slight change helps achieve better gradients during the initial phases of training.

Training GANs

 Training a GAN is like playing this game of two adversries. The generator is learning to generate good enough fake samples, while the discriminator is working hard to discriminate between real and fake. More formally, this is termed as the minimax game, where the value function V(G,D) is described as follows:

This is also called the zero-sum game, which has an equilibrium that is the same as the Nash equilibrium. We can better understand the value function V(G,D) by separating out the objective function for each of the players. The following equations describe individual objective functions:

where Jd is the discriminator objective function in the classical sense, Jg is the generator objective equal to the negative of the discriminator, and Pdata is the distribution of the training data. The rest of the terms have their usual meaning. This is one of the simplest ways of defining the game or corresponding objective functions. Over the years, different ways have been studied, some of which we will cover in this chapter.

The objective functions help us to understand the aim of each of the players. If we assume both probability densities are non-zero everywhere, we can get the optimal value of D(x) as:

We will revisit this equation in the latter part of the chapter. For now, the next step is to present a training algorithm whrein the discriminator and generator models frain towards their repspective objectives. The simplest yet widely used way of training a GAN(and by for the most successful one) i s as follows.

Repeat the following steps N times. N is the number of total iterations:

1. Repeat steps k tiems:

* Sample a minibatch of size m from the generator:{z1,z2...zm} = Pmodel(z)

* Sample a minibatch of size m from the actual data:{x1,x2,..xm} = Pdata(x)

* Update the discriminator loss, Jd

2. Set the discriminator as non-trainable

3. Sample a minibatch of size m from the generator: {z1, z2,...zm}=Pmodel(z)

4. Update the generator loss, Jg

In their original paper, Goodfellow et al. used k=1, that is, they trained discriminator and generator models alternately. There are some variants and hacks where it is observed that training the discriminator more often than the generator helps with better convergence.


The following figure(Figure 6.5) showcases the training phases of the generator and discriminator models. The smaller dotted line is the discriminator model, the solid line is the generator model, and the larger dotted line is the actual training data. The vertical lines at the bottom demonstrate the sampling of data points from the distribution of z, that is, x=pmodel(z). The line point to the fact that the generator contracts in the regions of high density and expands in the regions of low density. Part(a) shows the initial stages of the training phases where the discriminator (D) is a partially correct classifier. Parts(b) and (c) show thow improvements in D guide changes in the generator, G. Finally, in part(d) you can see where pmodel=pdata and the discriminator is no longer able to differentiate between fake and real samples, that is D(x)=1/2



2022년 3월 24일 목요일

The generator model

 This is the primary model of interest in the whole game. This model generates samples that are intended to resemble the samples from our training set. The model takes random unstructured noise as input (typically denoted as z) and tries to create a varied set of output. The generator model is usually a differentiable function; it is often represented by a deep neural network but is not restricted to that.

We denote the generator as G and its output as G(z). We typically use a lower-dimensional z as compared to the dimension of the orginal data, x, that is, Zdim <= Xdim. This is done as a way of compressing or encoding real-world information into lower-dimensional space.

In simple words, the generator trains to generate samlples good enough to fool the discriminator, while the discriminator trains to properly classify  real(training samples) versus fake (output from the generator). Thus, this game of adversaries uses a generator model, G, which tries to make D(G(z)) as close to 1 as possible. The discriminator is incentivized to make D(C(z)) close to 0, where 1 denotes real and 0 denotes fake samples. The GAN model achieves equlibrium when the generator starts to easily fool the discriminator, that is, the discriminator reaches its saddle point. While, in theory, GANs Have several advantages over other methods in the family tree described previously, they pose their own set of problems. We will discuss some of them in the upcoming sections.


2022년 3월 22일 화요일

The discriminator model

 This model represents a differentiable function that tries to maximize a probability of 1 for samples drawn from the training distribution. This can be any classification model, but we usually prefer a deep neural network for this. This is the throw-away model(similar to the decoder part of autoencodeers).

The discriminator is also used to classify whether the output from the generator is real or fake. The main utility of this model is to help develop a robust generator. We denote the discriminator model as D and its output as D(x). When it is used to classify output from the generator model. the discriminator model is denoted as D(G(z)), where G(z) is the output from the generator model.

Generative adversarial networks

 GANs have a pretty interesting origin story. It all began as a discussion / argument in a bar with lan Goodfellow and friends discussing work related to generating data using neural networks. The argument ended with everyone downplaying each other's methods. Goodfellow went back home and coded the first version of what we now calls a GAN. To his amazement, the code worked on the first try. Amore verbose description of the chain of events was shared by Goodfellow himself in an interview with Wired magazine.

As mentioned, GANs are implicit density functions that sample directly from the underlying distribution. They do this by defining a two-player game of adversaries. The adversaries compete against each other under welll-defined reward functions and each player tries to maximize its rewards. Without going into the details of game theory, the framework can be explained as follows.


The taxonomy of generative models

 Generative models are a class of models in the unsupervised machine learning space. They help us to model the underlying distributions responsible for generating the dataset under consideration. There are different methods/frameworks to work with generative models. The first set of methods correspond to models that represent data with an explicit density function. Here we define a probability density function, P, explicitly and develop a model that increases the maximum likelihood of sampling from this distribution.

There are two further types within explicit density methods, tractable and approximate density methods. PixelIRNNs are an active area of research for tractable density methods. When we try to model complex real-world data distribution, for example, natural images or speech signals, defining a parametri function becomes challenging. To overcome this, you learned about RBMs and VAEs in Chapter 4, Teching Networks to Generate Digits, and Chapter 5, Painting Pictures with Neural Networks Using VAEs, respectively. These techniques work by approximating the underlying probability density functions explicitly. VAEs work towards maximizing the likelihood estimates of the lower bound, while RBMs use Markov chains to make an estimate of the distribution. The overall landscape of generative models can be described using Figure 6.2:

GANs fall under implicity density modeling methods. The implicit density functions give up the property of explicity defining the underlying distribution but work by defining methods to draw samples from such distributions. The GAN framework is a class of methods that can sample directly from the underlying distributions. This alleviates some of the complexities associated with the methods we have coverd so far, such as difining underlying probability distribution functions and the quality of output. Now that you have a high-level understanding of generative models, let's dive deeper into the details of GANs.


2022년 3월 21일 월요일

6. Image Generation with GANs

 Generative modeling is a powerful concept that provides us with immense potential to approximate or model underlying processes that generate data. In the previous chapters, we covered concepts associated with deep learning in general and more specifically related to restricted Boltzmann machines(RBMs) and variational autoencoders(VAEs). This chapter will introduce another family of generative model called Generative Advaersarial Networks(GANs).

Heavily inspired by the concepts of game theory and picking up some of the best components from preiously discussed tetchniques, GANs provide a powerful framework for working in the generative modeling space. Since their invention in 2014 by Goodfellow et al., GANs have benefitted from termendous research and are now being used to explore creative domains such as art, fashion, and photography.

The following are two amazing high0quality samples from a variant of GANs called StyleGAN(Figure 6.1). The photograph of the kid is actually a fictional person who does not exist. The art sample is also generated by a similar network. StyleGANs are able to generrate high-quality sharp images by using the concept oof progressive growth (we will cover this in detail in later sections). These outputs were generated using the StyleGAN2 model trained on datasets such as the Flickr-Face-HQ or FFHQ dataset.

This chapter will cover:

- The taxonomy of generative models

- A number of improved GANs, such as DCGAN, Conditional-GAN, and so on

- The progressive GAN setup and its various components

- Some of the challenges associated with GANs

- Hands-on examples



2022년 3월 19일 토요일

Summary

 In this chapter, you saw how deep neural networks can be used to create representations of complex data such as images that capture more of their variance than traditional dimension reduction techniques, such as PCA. This is demonstrated using the MNIST digits, where a neural network can spatially separate the dirrerent digits in a two-dimensional grid more cleanly than the principal components of those images. The chapter showed how deep neural networks can be used to approximate complex posterior distribution, such as images, using variational methods to sample from an  approximation for an intractable distr5ibution, leading to a VAE algorithm based on minimizing the variational lower bound between the true and approximate posterior.

You also learned how the latent vector from this algorithm can be reparameterized to have lower variance, leading to better convergence in stochastic minibnatch gradient descent. You saw how the latent vectors generated by encoders in these models, which are usually independent, can be transformed into more realistic correlated distributions using IAF. Finally, we implemented these models on the CIFAR-10 dataset and showed how they can bbe used to rec onstruct the images and generate new images from random vectors.

The next chapter will introduce GANs and show how we can use them to add stylistic filters to input images, using the StyleGAN model.


Creating the network from TensorFlow 2

 Now that we've downloaded the CIFAR-10 dataset, split it into test and training data, and reshaped and rescaled it, we are ready to start building our VAE model. We'll use the same Model API from the Keras module in TensorFlow 2. The TensorFlow documentation contains an example of how to implement a VAE using convolutinal networks(https://www.tensorflow.org/tutorials/generative/cvae), and we'll build on this code example; however, for our purpose, we will implement simpler VAE enetwork using MLP layers based on the original VAE paper, AutoEncoding Variational Bayes, and show how we adapt the TensorFlow example to also allow for IAF modules in decoding.

In the original article, the authors propose two kinds of models for use in the VAE, both MLP feedforward networks: Gaussian and Bernoulli, with these names reflecting the probability distribution functions used in the MLP network outputs in their finals layers The Bernoulli MLP can be used so the decoder of the network, generating the simulated image x from the latent vector z. The formula for the Bernoulli MLP is:

Where the first line is the cross-entropy function we use to determine if the network generates an approximation of the original image in reconstruction, while y is a feedforward netwrok with two layers: a thanh transformation followed by a sigmoidal function to scale the output between 0 and 1. Recall that this scaling is why we had to normalize the CIFAR-10 pixels from their original values.

We can easily create this Bernoulli MLP network using the Keras API:

class BernoulliMLP(tf, keras.Model):

    def __init__(self, input_shape, name='BernoulliMLP', hidden)dim=10, latent_dim=10, **kwargs):

        super().__init__(name=name, **kwargs)

        self._h = tf.keras.layers.Dense(hidden_dim, activation='tanh')

        self._y = tf.keras.layers.Dense(latent_dim, activation='sigmoid')

    def call(self, x):

        return self._y(self._h(x)), None, None

We just need to specify the dimensions of the single hidden layer and the latent output(z). We then specify  the forward pass as a composition of these two layers. Note that in the output, we've returned threee values, with the second two set as None. This is because in our end model, we could use either the BernoulliMLP or Gaussian MLP as the decoder. If we used the GaussianMLP, we return three values, as we will see below; the example in this chapter utilizes a binaary output and cross entropy loss so we can use just the single output, but we want the return signatures for the two decoders to match.

The second network type proposed by the authors in the original VAE paper was a Gaussian MLP, whose formulas are:

This network can be used as either the encoder (generating the latent vector z) or the decoder (generating the simulated image x) in the network. The equations above assume that it is used as the decoder, and for the encoder we just switch the x and z variables. As you can se, this network has two types of layers, a hidden layer given by a tanh transformation of the input, and two output layers, each given by linear transformations of the hidden layer, which are used as the inputs of a lognormal likelihood function. Like the Bernoulli MLP, we can easily implement this simple network using the TensorFlow Keras API:

class GaussianMLP(tf.keras.Model):

    def __init__(self, input_shape, name='GaussianMLP', hidden_dim=10, latent_dim=10, iaf=False, **kwargs):

        super().__init__(name=name, **kwrgs)

        self._h = tf.keras.layers.Dense(hidden_dim, activation='tanh')

        self._logvar = tf.keras.layers.Dense(latent_dim)

        self._iaf_output = None

        if iaf:

            self._iaf_output = tf.keras.layers.Dense(latent_dim)

    def call(self, x):

        if self._laf_output:

            return self._mean(self._h(x)), self._logvar(self._h(x)), self._iaf_output(self._h(x))

        else:

            return self._mean(self._h(x)), self._logvar(self._h(x)), None

As you can see, to implement the call function, we must return the two outputs of the model(the mean and log variance of the normal distribution we'll use to compute the likelihood of z or z). However, recall that for the IAE model, the encoder has to have an additional output h, which is fed into each step of the normalizing flow:

To allow for this additional output, we include a third variable in the output, which get set to a linear transformation of the input if we set the IAF options to True, and is none if False, so we can use the GaussianMLP as an encoder in networks both with and without IAF.

Now that we have both of our subnetworks defined, let's see how we can use them to construct a complete VAE network. Like the sub-networks, we can define the VAE using the Keras API:

class VAE(tf.keras.Modle):

    def __init__(self, input_shape, name='variational_autoencoder', latent_dim=10, hidden_dim=10, encoder='GaussianMLP', decoder='BernoulliMLP', iaf_model=None, number_iaf_networks=0, iaf_params={}, num_samples=100, **kwars):

        super().__init__(name=name, **kwargs)

        self._latent_dim = latent_dim

        self._num_samples = num_samples

        self._iaf = []

        if encoder == 'GaussianMLP':

            self._encoder = GaussianMLP(input_shape=input_shape, latent_dim=latent_dim, iaf=(iaf_model is not None), hidden_dim=hidden_dim)

        else:

            raise ValueError("Unknown encoder type: {}", format(encoder))

        if decoder == 'BernoulliMLP':

            sekf,_decoder = BernoulliMLP(input_shape=(1, latent_dim), latent_dim=input_shape[1], hidden_dim=hidden_dim)

        elif decoder == 'GaussianMLP':

            self._encoder = GaussianMLP(input_shape=(1, latent_dim), latent_dim=input_shape[1], iaf=(iaf_modl is not None), hidden_dim=hidden_dim)

        else:

            raise ValueError("Unknown decoder type: {}", format(decoder))

        if iaf_model:

            self._iaf = []

            for t in range(number_iaf_networks):

                self._iaf.append(iaf_model(input_shape==(1, latent_dim*2), **iaf_params))

As you can see, this model is defined to contain both an encoder and decoder network. Additionally, we allow the user to specify whether we are implementing IAF as part of the model, in which case we need a stack of autoregressive trasforms specified by the iaf_params variable, Because this IAF network needs to take both z and h as inputs, the input shape is twice the size of the latent_dim(z). We allow the decoder to be either the GaussianMLP or BernoulliMLP network, while the encoder si the GaussianMLP.

There are a few other function of this model class that we need to cover; the first are simply the encoding and decoding, functions of the VAE model class:

def encode(self, x):

        return self._encoder.call(x)

    def decode(self, z, apply_sigmoid=False):

        logits, _, _ = self._decoder.call(z)

        if apply_sigmoid:

            probs = tf.sigmoid(logits)

            return probs

        return logits

For the encoder, we simply call(run the forward pass for) the encoder network. To decode,you will notice that we specify three outputs. The article that introduced VAE models, Autoencoding Variational Bayes, provided examples of a decodeer specified as either a Gaussian Multilayer Perceptron(MLP) or Benoulli output. If we used a Gaussian MLP, the decoder would yield the value, mean, and standard deviation vectors for the output, and we need to transform that output to a probability (0 to 1) using the sigmoidal transform. In the Bernoulli case, the output is already in the range 0 to 1, and we don't need this transformation (apply_sigmoid=False).

Once we've trained the VAE network, we'll want to use sampling in order to generate random latent vectors(z) and run the decoder to generate new images. While we could just run this as a normal function of the class in the Python runtime, we'll decorate this function with the @tf. function annotation, which will allow it to be executed in the TensorFlow graph runtime (just like any of the tf functions, such as reduce_sum and muliply), making using of GPU and TPU device if they are available. We sample a value from a random normal distribution, for a specified number of samples, and then apply the decoder to generate new images:

@tf.function

    def sample(self, eps=None):

        if eps is None:

            eps = tf.random.normal(shape=(self._num_samples, self.latent_dim))

        return self._Decoder.call(eps, applyu_sigmoid=False)

Finally, recall that the "reparamterization trick" is used to allow us to backpropagate through the value of z and reduce the variance of the likelihood of z. We need to implement this transformation, which is given by:

def reparameterize(self, mean, logvar):

        eps = tf.random.normal(shape=mean.shape)

        return sps * tf.exp(logvar * .5) + mean

In the ooriginal paper, Autoencoding Variational Bayes, this is given by:

where i is a data point in x and 1 is a sample from the random distribution, here, a normal. In our code, we multiply by 0.5 because we are computing the log variance (or standard deviation squared), and log(s^2) log(s)2, so the 0.5 cancels the 2, leaving us with exp(log(s)) = s, just as we require in the formula.

We'll also include a class property (with the @property decorator) so we can access the array of normalizing transforms if we implement IAF:

@property

    def iaf(self):

        return self._iaf

Now, we'll need a few additional functions to actually run our VAE algorithm. The first computers the log normal probability density function(pdf), used in the computation of the variational lower bound, or ELBOL:

def log_normal_pdf(sample, mean, logvar, raxis=1):

    log2pi = tf.math.log(2. * np.pi)

    return tf.reduce_sum(

        S * ((sample - mean) ** 2. * tf.exp(-logvar) + \ logvar + log2pi), axis=raxis)

We now need to utilize this function as part of computing the loss with each minbatch gradient descent pass in the process of training the VAE. As with the sample method, we'll decorate this function with the @tf. function with the @tf.function annotation so it will be executed on the graph runtime:

@tf.function

def compute_loss(model, x):

    mean, logvar, h = model.encode(x)

    z = model.reparameterize(mean, logvar)

    logqz_x = log_normal_pdf(z, mean, logvar)

    for iaf_model in model.iaf:

        mean, logvar, _ = iaf_model.call(tf.concat([z, h], 2))

        s = tf.sigmoid(logvar)

        z = tf.add(tf.math.multiply(z,s), tf.math.mutiply(mean, (i-s))

        logqz_x -= tf.reduce_sum(tf.math.log(s))

    x_logit = model.decode(z)

    coss_ent = tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=x)

    logpx_z = -tf.reduce_sum(cross_ent, axis=[2])

    logpz = log_normal_pdf(z, 0., 0.)

    return -tf.reduce_mean(logpx_z + logpz - logqz_x)

Let's unpack a bit of what is going on here. First, we can see that we call the encoder network on the input(a minibatch of flattened images, in our case) to generate the needed mean, logvariance, and, if we are using IAF in our network, the accessory input h that we'll pass along with each step of the normalizing flow transform.

We apply the "reparameterization trick" on the inputs in order to generate the latent vector z, and apply a lognormal pdf to get the logq(z|x).

If we are using IAF, we need to iteratively transform z using each network, and pass in the h(accessory input) from the decoder at each step. Then we apply the loss from this transform to the initial loss we computed, as per the algorithm given in the IAF paper.

Once we have the transformed or untransformed z, we decode it using the decoder network to get the reconsturcteed data, x, from which we calculate a cross-entropy loss. We sum these over teh minibatch and take the lognormal pdf of z evaluated at a standard normal distribution(the prior), before computing the expected lower bound.

Recall that the expression for the variational lower bound, or ELBO, is:

So, our minibatch estimator is a sample of this value:

Now that we have these ingredients, we can run the stochastic gradient descent using the GradientTape API, just as we did for the DBN in Chapter 4, Teaching Networks to Generate Digits passing in an optimizer, model, and minibatch of data(x):

@tf.function

def compute_apply_gradient(model, x, optimizer):

    with tf.GradientTape() as tape:

        loss = compute_loss(model, x)

    gradients = tape.gradient(loss, model.trainable_variables)

    optimizer.apply_gradients(zip(gradients, model.trainable_Variables))

To run the training, first we need to specify a model using the class we've built. If we don't want to use IAF, we could do this as follows:

model = VAE(input_shape=(1, 3072), hidden_dim=500, latent_dim=500)

If we want to use IAF transformations, we need to include some additional arguments:

model = VAE(input_shape=(1, 3072), hidden_dim=500, latent_dim=500, iaf_model=GaussianMLP, number_iaf_networks=3, iaf_params=('latent_dim': 500,' hidden_dim':500, 'iaf':False})

With the model created,, we need to specify a number of epochs, an optimizer(in this instance, Adam, as we described in Chapter 3, Building Blocks of Deep Neural Networks). We split our data into minibatches of 32 elements, and apply gradient updates after each minibatch for the number of epochs we've specified. At regular intervals, we output the estimate of the ELBO to verify that our model is getting better:

import time as time

epochs = 100

optimizer = tf.keras.optimizers.Adam(le-4)

for epoch in range(1, epochs + 1):

    start_time = time.time()

    for train_x in cifar10_train.map(lambda x: flatten_image(x, label=False)).batch(32):

        compute_apply_gradients(model, train_x, optimizer)

    end_time = time.time()

    if epch % 1 == 0:

        loss = tf.keras.metrics.Mean()

        for test_x in Cifar10_test.map(lambda x: flatten_image(x, label=False)).batch(32)):

            loss(compute_loss(model, test_x))

    elbo = -loss.result()

    print('Epoch: {}, Test set ELBO:{}, ''time elapse for current epoch {}'.format(epoch, elbo, end_time - start_time))

We can verify that the model is improving by looking at updates, which should show an increasing ELBO:

To examine the output of the model, we can first look at the reconstruction error; does the  encoding of the input image by the network approximately capture the dominant patterns in the input image, allowing it to be reconstructed from its vector z? We can compare the raw image to its reconstruction formed by passing the image through the encoder, applying, IAF, and then decoding it:

for sample in cifar10_train.map(lambda x: flatten_image(x, label-False)).batch(1).take(10):

    mean, logvar, h = model.encode(sample)

    z = model.reparameterize(mean, logvar)

    for iaf_model in model.iaf:

        mean, logvar, _ = iaf_model.call(tf.concat({z, h], 2))

        s = tf.sigmoid(logvar)

        z = tf.add(tf.math.multiply(z,s), tf.math.multiply(mean, (1-s))

    plt.figure(0)

    plt.imshow((sample.numpy().reshape(32,32,3)).astype(np, float32), cmap=plt.get_camp("gray"))

For the first few CIFAR-10 images, we get the following output, showing that we have captured the overall pattern of the image (although it is fuzzy, a general downside to VAEs that we'll address in our discussion of Generative Adversarial Networks(GANs) in future chapters):

What if we wanted to create entirely new images? Here we can use the "sample" function we defined previously in Creating the network from TensorFlow 2 to create batches of new images from randomly generated z vectors, rather than the encoded product of CIFAR images:

plt.imshow((model.sample(10)).numpy().reshape(32,32,3)).astype(np.float32), cmap-plt.get_camp("gray"))

This code will produce output like the following, which shows a set of images generated from vectors of random numbers:

These are, admittedly, a bit blurry, but you can appreciate that they show structure and look comparable to some of the "reconstructed" CIFAR-10 image you saw previously. Part of the challenge here, as we'll discuss more in subsequent chapters, is the loss function itself: the cross-entropy function, in essence, penalizes each pixel for how much it resembles the input pixel, While this might be mathematically correct, it doen't capture what we might think of as conceptual "similarity" between an input and reconstructed image. For example, an input image could have a single pixel set to infinity, which would create a large difference between it and a reconstruction that set that pixel to 0; however, a human, looking at the image, would perceive both as being identical. The objective functions used for GANs, described in Chapter 6, Image Generation with GANs, capture this nuance more accurately.







 


    




2022년 3월 17일 목요일

Importing CIFAR

 Now that we've discussed the underlying theory of VAE algorithms, let's start building a practical example using a real-world dataset. As we discussed in the introduction, for the experiments in this chapter, we'll be working with the Canadian Institute for Advanced Research (CIFAR) 10 dataset. The images in this dataset are part of a larger 80 million "small image" dataset, most of which do not have class labels like CIFAR-10, the labels were initially created by student volunteers, and the larger tiny images dataset allows researchers to submit labels for parts of the data.

Like the MNIST dataset, CIFAR-10 can be downloaded using the TEnsorFlow dataset's API:

import tensorflow.compat.v2 as tf

import tensorflow_datasets as tfds

cifar10_builder = tfds.builder("cifar10")

cifar10_builer.download_and_prepare()

This will download the dataset to disk and make it available for our experiments. To split it into training and test sets, we can use the following commands:

cifar10_train = cifar10_builder.as_dataset(split="train")

cifar10_test = cifar10_builder.as_dataset(split="test")

Let's inspect one of the images to see what format it is in:

cifar10_train.take(1)

The output tells us that each image in the dataset is of format  <DatasetV1Adapter shapes: {image: (32,32,3), label: ()}, types: {image:tf.uint8, label: tf.int64}>:Unlike the MNIST dataset we used in Chapter 4, Teaching Networks to Generate Digits, the CIFAR images have three color channels, each with 32 * 32 pixels, while the label is an integer from 0 to 9(representing on of the 10 classes). We can also plot the images to inspect them visually:

from PIL import Image

import numpy as np

import matplotlib.pyplot as plt

for sample in cifar10_train.map(lambda x: flatten_image(x, label=True)).take(1):

    plt.imshow(sample[0].numpy().reshape(32,32,3), astype(np.float32), cmap=plt.get_cmap("gray"))

    print("Label:L %d" % sample[1].numpy())

This gives the following outpt:

Like the RBM model, the VAE model we'll build in this example has an output scaled between 1 and 0 and accepts flattened versions of the images, so we'll need to turn each image into a vector and scale it to maximum of 1:

def flattern_image(x, label=False):

    if label:

        return (tf.divide(tf.dtypes.cast(tf.reshape(x["image"], (1, 32*32*3)), tf.floate32), 256.0), x["label"])

    else:

        return ( tf.divide(tf.dtypes.cast(tf.reshape(x["image"], (1, 32*32*3)), tf.float32), 256.0))

This results in each image being a vector of length 3072(32*32*3), which we can reshape once we've run the model to examine the generated  images.

Inverse Autoregressive Flow

 In our discussion earlier, it was noted that we want to use q(z|x) as a way to approximate the "true" p(z|x) that would allow us to generate an ideal encoding of the data, and thus sample from it to generate new images. So far, we've assumed that q(z|x) has a relatively simple distribution, such as a vector of Gaussian distribution random variables that are independent(a diagonal covariance matrix with 0s on the non-diagonal elements). This sort of distribution has many benefits; because it is sample, we have an easy way to generate new samples by drawing from random normal distributions, and because it is independent, we can separately tune each element of the latent vector z to influence parts of the output image.

However, such a simple distribution may not fit the desired output distribution of data well, increasing the KL divergence between p(z|x) and q(z|x). Is ther a way we can keep the desirable properties of q(z|x) but "transform" z so that it captures more of the complexities needed to represent x?

One approach is to apply a series of autoregressive transformations to z to turn it from a simple to a complex distribution; by "autoregressive," we mean that each transformation utilizes both data from the previous transformation and the current data to compute an updated version of z. In contrast, the basic form of VAE that we introduced above has only a single "transformation:" from z to the output(though z might pass through multiple layers, there is no recursive network link to further refine that output). We've seen such transformations before, such as the LSTM networks in Chapter 3, Building Blocks of Deep Neural Networks, where the output of the network is combination of the current input and a weighted version of prior time step.

An attractive property of the independent q(z|x) distributions we discussed earlier, such as independent normals, is that they have a very tractable expression for the log likelihood. This property is important for the VAE model because its objective function depends on integrating over the whole likelihood function, which would be cumbersome for more complex log likelihood functions. However, by constraining a transformed z to computation through a series of autoregressive transformations, we have the nice property that the log-lieklihood of step t only depends on t-1, thus the jacobian (gradient matrix of the partial derivative between t and t-1)is lower triangular and can be computed as a sum:

What kinds of trasformations f could be used? Recall that after the parameterization trick, z is a function of a noise element e and the mean and standard deviation output by the encoder Q:

If we apply successive layers of transformation, step t becomes the sum of u and the element-wise product of the prior layer z and the sigmoidal output 

In practice, we use a neural network transformation to stabilize the estimate of the mean at each step:

Again, note the similarity of this transformation to the LSTM networks discussed in Chapter 3, Building Blocks of Deep Neural Networks, In Figure 5.8, there is another output (h) from the encoder Q in addition to the mean and standard deviation in order to sample z. H is, in essence, "accessory data" that  is passed into each successive transformation and, along with the weighted sum that is being calculated at each step, represents the "persistent memory" of the network in a way reminiscent of the LSTM.



2022년 3월 13일 일요일

The reparameterization trick

 In order to allow us to backpropagate through our autoencoder, we need to transform the stochastic samples of z into a deterministic, differentiable transformation. We can do this by reparameterizing z as a function of a noise variable E:

Once we have smapled from E, the randomness in z no longer depends on the parameters of the variational distribution Q(the encoder), and we can backpropagate end to end. Our network now look like Figure 5.7, and we can optimize our objective using random samples of E(for example, a standard normal distribution).

This reparameterization moves the "random" node out the encoder/decoder framework so we can backpropagate through the whole system, but it slao has a subtler advantage; it reduces the variance of these gradients. Note that in the un-reparameterized netwrk, the distribution of z depends on the parameters of the encoder distribution Q; thus, as we are changing the parameters of Q, we are also changeing the distribution of z, and we would need to potentially use a large number of samples to get a decent estimate.

By reparameterizing, z now depends only on our simpler function, g, with randomness introduced through sampling E from a standard normal (that doesn't depend on Q): hence, we've removed a somewhat circular dependency, and made the gradients we are estimating more stable:

Now that you have seen how the VAE network is constructed, let's discuss a further refinement of this algorithm that allows VAEs to sample from complex distribution;

Inverse Autoregressive Flow(IAF).

The variational objective

 We previously covered several examples of how images can be compressed into numerical vectors using neural networks. This section will introduce the elements that allow us to create effective encodings to sample new images from a space of random numerical vectors, which are principally efficient inference algorithms and appropriate objective functions, Let's start by quantifying more rigorously what make such an encoding "good" and allows us to recreate images well. We will need to maximize the posterior:

p(z|w) = p(x|z)p(z)/p(x)

A problem occurs when the probability of x is extremely high dimensional, which, as you saw, can occur in even simple data such as binary MNIST digits, where we have 2^(number of pixels) possible configurations that we would need to integrate over (in a mathematical sense of integrating over a probability distribution) to get a measure of the probability of an individual image; in other words, the density p(x) is intractable, making the posterior p(z|x), which depends on p(x), likewise intractable.

In some cases, such as you saw in Chapter 4, Teaching Networks to generate Digits, we can use simple cases such as binary units to compute an approximation such as contrastive divergence, which allows us to still compute a gradient even if we can't calculate a closed from. However, this might also be challenging for very large datasets, where we would need to make many passes over the  data to compute an average gradient using Contrastive Divergence(CD), as you saw previously in Chapter 4, Teaching Networks to Generate Digits.

If we can't calculate the distribution of our encoder p(z|x) directly, maybe we could optimaize an approximation that is "close enough"-let's called this q(z|x). Then, we could use a measure to determine if the distributions are close enough. One useful measure of closeness is whether the two ditributions encode similar information; we can quantify information using the Shannon Information equation:

l(p(x)) = -log(p(x))

Consider why this is a good measure: as p(x) decreases, an event becomes rarer, and thus observation of the event communicates more information about the system or dataset, leading to a positive value of -log(p(x)). Conversely, as the probability of an event nears 1, that event encodes less information about the dataset, and the value of -log(p(x)) becomes 0

Thus, if we wanted to measure the difference between the information encoded in two distributions, p and q, we could use the difference in their information:

l(p(x)) - l(q(x)) = -log(p(x)) + log(q(x)) = log(q(x)/p(x))

Finally, if we want to find the expected difference in information between the distributions for all elements of x, we can take the average over p(x):

This quantity is known as the Kulback Leibler(KL) Divergence. It has a few interesting properties:

1. It is not symmetric: KL(p(x), q(x)) does not, in general, equal KL(q(x)), p(x)), so the "closeness" is measured byu mapping one distribution to another in a particular direction.

2. Whenever q(x) and p(x) match, the term is 0, meaning they are a minimum distance from one another, Likewise,  KL(p(x)), q(x)) is 0 only if p and q are identical.

3. If q(x) is 0 or p(x) is 0, then KL is undefined; by definition, it only computes relative information over the range of x where the two distributions match.

4. KL is always greater than 0.

If we were to use the KL divergence to compute how well an approximation q(z,x)is of our intractable p(z|x), we could write:

Now we can write an expression for our intractable p(x) as well: since log(p(x)) does not depend on q(z|x), teh expectation with respect to p(x) is simply log(p(x)). Thus, we con represent the objective of the VAE, learning the marginal distributiton of p(x), using the KL divergence:

The second term is also known as the Variational Lower Bound, which is also referred to as the Evidence Lower Bound(ELBO); since KL(q,p) is strictly greater than 0, log(p(x)) is strictly greater than or (if KL(q,p) is 0) equal to this value.

To explain what this objective is doing, notice that the expectation intruduces a difference between q(z|x)(encoding x) and p(x|z)p(z) (thie joint probability of the data and the encoding); thus we want to minimize a lower bound that is essentially the gap betwwen the probablity of the encoding and the joint probability of the encoding and data, with an error term given by KL(q,p), the difference betwwen a tractable approximation and intractable form of the encoder p(z|x), the difference between a tractable approximation and intractable form of the encoder p(z|x), We can imagine the functions Q(z|x) and P(x|z) being represented by two deep neural networks; one generates the latent code z(Q), and the other reconstructs x from this code(P). We can imagine this as an autoencoder setup, as above with the stacked RBM models, with an encoder and decoder:

We want to optimize the parameters of the encoder Q and the decoder P to minimize the reconstruction cost. One way to do this is to construct Monte Carlo samples to optimize the parameters  of Q using gradient descent:

However, it has been found in practice that a large number of samples may be required in order for the variance of these gradient updates to stabilize.

We also have a practical problem here: even if we could choose enough sample to get a good approximation of the gradients for the encoder, out network ontains a stochasitc, non-differentiable step (sampling z) that we can't backpropagate through, in a similar way we couldn't backpropagate through the stochastic units in the RBM in Chapter 4, Teaching Networks to Generate Digits. Thus, our reconstruction error depends on samples from z, but we can't backpropagate through the step that generates these samples to tune the network end to end. Is there a way we can create a differentialbe decoder/encoder architecture while also reducting the variance of sample estimates? One of the main insights of the VAE is to enable this through the "reparameterization trick."



2022년 3월 12일 토요일

Creating separable encodings of images

 In Figure 5.1, you can see an example of images from the CIFAR-10 dataset, along with an example of an early VAE algorithm that can generate fuzzy versions of these images based on a random number input:

More recent work on VAE networks has allowed these models to generate much better images, as you will see later in this chapter. To start, let's revisit the problem of generating MNIST digits and how we can extend this approach to more complex data.

Recall from Chapter 1, An Introduction to Generative AI:"Drawing" Data from Models and Chapter 4, Teching Networks to Generate Digit that the RBM(or DBN) model in essence involves learning the posterior probability distribution for images(x) given some latent "code"(z), represented by the hidden layer(s) of the network, the "marginal likelihood" of x:

We can see z as being an "encoding" of th eimage x (for example, the activations of the binary hidden units in the RBM), which can be decoded(for example, run the RBM in reverse in order to sample an image) to get a reconstruction of x. If the encoding is "good," the reconstruction will be close to the original image. Because these networks encode and decode representations of their input data, they are also known as "autoencoders."

The ability of deep neural networks to capture the underlying structure of complex data is one of their most attractive features; as we saw with the DBN model in Chapter 4, Teaching Networks to Generate Digits, it allows us to improve the performace of a clasifier by creating a better underlying model for the distrubution of the data. It can also be used to simply create a better way to "compress" the complexity of data, in a similar way to principal component analysis(PCA) in clasical statistice. In Figure 5.2, you can see how the stacked RBM model can be used as a way to encode the distribution of faces, for example.

We start with a "pre-training" phase to create a 30-unit encoding vector, which we then calibrate by forcing it to reconstruct the input image, before fine-tuning with standard backpropagation:

As an example of how the stacked RBM model can more effectively represent the distribution of images, the authors of the paper Reducing the Dimensionality of Data with Neural networks, from which Figure 5.2 is derived, demonstrated using a two-unit code for the MNIST digits compared to PCA:

On the left, we see the digits 0-9(represented by different shades and shapes) encoded using 2-dimensional PCA. Recall that PCA is generated using a low-dimensional factorization of the covariance matrix of the data:

Where  Cov(X) is the same height/width M as the data(for example, 28 by 28 picels in MNIST) and U and V are both lower dimensional (M * K and K * M), where k is much smaller than M. Because they have a smaller number of rows/columns k than the original data in one dimension, U and V are lower-dimensional represetations of the data, and we can get an encoding of an individual image by projecting it onto these k vectores, giving a k unit encoding of the data. Since the decomposition (and projection) is a linear transformation (multiplying two matrices), the ability of the PCA components to distinguish data well depends on the data being linearly separable(we can draw a hyperplane through the space between groups-that space could be two-dimensional or N dimensional, like the 784 pixels in the MNIST images).

As you can see in Figure 5.3, PCA yields overlapping codes for the images, showing that it is challenging to represent digits using a two-component linear decomposition, in which vector representing the same digit are close together, while those representing different digits are clearly separated. Conceptually, the neural network is able to capture more of the variation between images representing different digits than PCA, as shown by its ability to separate the representations of these digits more clearly in a two-dimensional space.

As an analogy to understand this phenomenon, consider a very simple two-dimensional dataset consisting of parallel hyperbolas (2 power polynomials)

At the top, even though we have two distinct classes, we cannot draw a straight line through two-dimensional space to separate the two groups; in a neural network, the weight matrix in a single layer before the nonlinear transformation of a sigmoid or tanh is, in essence, a linear boundary of this kind. However, if we apply a nonlinear transformation to our 2D coordinates, such as taking the square root of the hyperbolas, we can create two separable planes(Figure 5.4, bottom).

A similar phenomenon is at play with our MNIST data: we need a neural network in order to place these 784-digit images into distinct, separable regions of space. 

This goal is achieved by performing a non-linear transformation ono the original, overlapping data, with an objective funciton that reqards increasing the spatial separation among vectors encoding the images of different digits. A separable representation thus increases the ability of the neural network to differentiate image clases using these representations. Thus, in Figure 5.3, we can see on the right that applying the DBN model creates the required non-linear trasformation to separate the different images.

Now that we've covered how neural networks can compress data into numercal vectors and what some desirable properties of those vector representation are, we'll examine how to optimally compress information in these vectors. To do so, each element of the vector should encode distinct information from the others, a property we can achieve using a variational objective. This variational objective is the building block for creating VAE networks.