페이지

2022년 3월 17일 목요일

Inverse Autoregressive Flow

 In our discussion earlier, it was noted that we want to use q(z|x) as a way to approximate the "true" p(z|x) that would allow us to generate an ideal encoding of the data, and thus sample from it to generate new images. So far, we've assumed that q(z|x) has a relatively simple distribution, such as a vector of Gaussian distribution random variables that are independent(a diagonal covariance matrix with 0s on the non-diagonal elements). This sort of distribution has many benefits; because it is sample, we have an easy way to generate new samples by drawing from random normal distributions, and because it is independent, we can separately tune each element of the latent vector z to influence parts of the output image.

However, such a simple distribution may not fit the desired output distribution of data well, increasing the KL divergence between p(z|x) and q(z|x). Is ther a way we can keep the desirable properties of q(z|x) but "transform" z so that it captures more of the complexities needed to represent x?

One approach is to apply a series of autoregressive transformations to z to turn it from a simple to a complex distribution; by "autoregressive," we mean that each transformation utilizes both data from the previous transformation and the current data to compute an updated version of z. In contrast, the basic form of VAE that we introduced above has only a single "transformation:" from z to the output(though z might pass through multiple layers, there is no recursive network link to further refine that output). We've seen such transformations before, such as the LSTM networks in Chapter 3, Building Blocks of Deep Neural Networks, where the output of the network is combination of the current input and a weighted version of prior time step.

An attractive property of the independent q(z|x) distributions we discussed earlier, such as independent normals, is that they have a very tractable expression for the log likelihood. This property is important for the VAE model because its objective function depends on integrating over the whole likelihood function, which would be cumbersome for more complex log likelihood functions. However, by constraining a transformed z to computation through a series of autoregressive transformations, we have the nice property that the log-lieklihood of step t only depends on t-1, thus the jacobian (gradient matrix of the partial derivative between t and t-1)is lower triangular and can be computed as a sum:

What kinds of trasformations f could be used? Recall that after the parameterization trick, z is a function of a noise element e and the mean and standard deviation output by the encoder Q:

If we apply successive layers of transformation, step t becomes the sum of u and the element-wise product of the prior layer z and the sigmoidal output 

In practice, we use a neural network transformation to stabilize the estimate of the mean at each step:

Again, note the similarity of this transformation to the LSTM networks discussed in Chapter 3, Building Blocks of Deep Neural Networks, In Figure 5.8, there is another output (h) from the encoder Q in addition to the mean and standard deviation in order to sample z. H is, in essence, "accessory data" that  is passed into each successive transformation and, along with the weighted sum that is being calculated at each step, represents the "persistent memory" of the network in a way reminiscent of the LSTM.



2022년 3월 13일 일요일

The reparameterization trick

 In order to allow us to backpropagate through our autoencoder, we need to transform the stochastic samples of z into a deterministic, differentiable transformation. We can do this by reparameterizing z as a function of a noise variable E:

Once we have smapled from E, the randomness in z no longer depends on the parameters of the variational distribution Q(the encoder), and we can backpropagate end to end. Our network now look like Figure 5.7, and we can optimize our objective using random samples of E(for example, a standard normal distribution).

This reparameterization moves the "random" node out the encoder/decoder framework so we can backpropagate through the whole system, but it slao has a subtler advantage; it reduces the variance of these gradients. Note that in the un-reparameterized netwrk, the distribution of z depends on the parameters of the encoder distribution Q; thus, as we are changing the parameters of Q, we are also changeing the distribution of z, and we would need to potentially use a large number of samples to get a decent estimate.

By reparameterizing, z now depends only on our simpler function, g, with randomness introduced through sampling E from a standard normal (that doesn't depend on Q): hence, we've removed a somewhat circular dependency, and made the gradients we are estimating more stable:

Now that you have seen how the VAE network is constructed, let's discuss a further refinement of this algorithm that allows VAEs to sample from complex distribution;

Inverse Autoregressive Flow(IAF).

The variational objective

 We previously covered several examples of how images can be compressed into numerical vectors using neural networks. This section will introduce the elements that allow us to create effective encodings to sample new images from a space of random numerical vectors, which are principally efficient inference algorithms and appropriate objective functions, Let's start by quantifying more rigorously what make such an encoding "good" and allows us to recreate images well. We will need to maximize the posterior:

p(z|w) = p(x|z)p(z)/p(x)

A problem occurs when the probability of x is extremely high dimensional, which, as you saw, can occur in even simple data such as binary MNIST digits, where we have 2^(number of pixels) possible configurations that we would need to integrate over (in a mathematical sense of integrating over a probability distribution) to get a measure of the probability of an individual image; in other words, the density p(x) is intractable, making the posterior p(z|x), which depends on p(x), likewise intractable.

In some cases, such as you saw in Chapter 4, Teaching Networks to generate Digits, we can use simple cases such as binary units to compute an approximation such as contrastive divergence, which allows us to still compute a gradient even if we can't calculate a closed from. However, this might also be challenging for very large datasets, where we would need to make many passes over the  data to compute an average gradient using Contrastive Divergence(CD), as you saw previously in Chapter 4, Teaching Networks to Generate Digits.

If we can't calculate the distribution of our encoder p(z|x) directly, maybe we could optimaize an approximation that is "close enough"-let's called this q(z|x). Then, we could use a measure to determine if the distributions are close enough. One useful measure of closeness is whether the two ditributions encode similar information; we can quantify information using the Shannon Information equation:

l(p(x)) = -log(p(x))

Consider why this is a good measure: as p(x) decreases, an event becomes rarer, and thus observation of the event communicates more information about the system or dataset, leading to a positive value of -log(p(x)). Conversely, as the probability of an event nears 1, that event encodes less information about the dataset, and the value of -log(p(x)) becomes 0

Thus, if we wanted to measure the difference between the information encoded in two distributions, p and q, we could use the difference in their information:

l(p(x)) - l(q(x)) = -log(p(x)) + log(q(x)) = log(q(x)/p(x))

Finally, if we want to find the expected difference in information between the distributions for all elements of x, we can take the average over p(x):

This quantity is known as the Kulback Leibler(KL) Divergence. It has a few interesting properties:

1. It is not symmetric: KL(p(x), q(x)) does not, in general, equal KL(q(x)), p(x)), so the "closeness" is measured byu mapping one distribution to another in a particular direction.

2. Whenever q(x) and p(x) match, the term is 0, meaning they are a minimum distance from one another, Likewise,  KL(p(x)), q(x)) is 0 only if p and q are identical.

3. If q(x) is 0 or p(x) is 0, then KL is undefined; by definition, it only computes relative information over the range of x where the two distributions match.

4. KL is always greater than 0.

If we were to use the KL divergence to compute how well an approximation q(z,x)is of our intractable p(z|x), we could write:

Now we can write an expression for our intractable p(x) as well: since log(p(x)) does not depend on q(z|x), teh expectation with respect to p(x) is simply log(p(x)). Thus, we con represent the objective of the VAE, learning the marginal distributiton of p(x), using the KL divergence:

The second term is also known as the Variational Lower Bound, which is also referred to as the Evidence Lower Bound(ELBO); since KL(q,p) is strictly greater than 0, log(p(x)) is strictly greater than or (if KL(q,p) is 0) equal to this value.

To explain what this objective is doing, notice that the expectation intruduces a difference between q(z|x)(encoding x) and p(x|z)p(z) (thie joint probability of the data and the encoding); thus we want to minimize a lower bound that is essentially the gap betwwen the probablity of the encoding and the joint probability of the encoding and data, with an error term given by KL(q,p), the difference betwwen a tractable approximation and intractable form of the encoder p(z|x), the difference between a tractable approximation and intractable form of the encoder p(z|x), We can imagine the functions Q(z|x) and P(x|z) being represented by two deep neural networks; one generates the latent code z(Q), and the other reconstructs x from this code(P). We can imagine this as an autoencoder setup, as above with the stacked RBM models, with an encoder and decoder:

We want to optimize the parameters of the encoder Q and the decoder P to minimize the reconstruction cost. One way to do this is to construct Monte Carlo samples to optimize the parameters  of Q using gradient descent:

However, it has been found in practice that a large number of samples may be required in order for the variance of these gradient updates to stabilize.

We also have a practical problem here: even if we could choose enough sample to get a good approximation of the gradients for the encoder, out network ontains a stochasitc, non-differentiable step (sampling z) that we can't backpropagate through, in a similar way we couldn't backpropagate through the stochastic units in the RBM in Chapter 4, Teaching Networks to Generate Digits. Thus, our reconstruction error depends on samples from z, but we can't backpropagate through the step that generates these samples to tune the network end to end. Is there a way we can create a differentialbe decoder/encoder architecture while also reducting the variance of sample estimates? One of the main insights of the VAE is to enable this through the "reparameterization trick."



2022년 3월 12일 토요일

Creating separable encodings of images

 In Figure 5.1, you can see an example of images from the CIFAR-10 dataset, along with an example of an early VAE algorithm that can generate fuzzy versions of these images based on a random number input:

More recent work on VAE networks has allowed these models to generate much better images, as you will see later in this chapter. To start, let's revisit the problem of generating MNIST digits and how we can extend this approach to more complex data.

Recall from Chapter 1, An Introduction to Generative AI:"Drawing" Data from Models and Chapter 4, Teching Networks to Generate Digit that the RBM(or DBN) model in essence involves learning the posterior probability distribution for images(x) given some latent "code"(z), represented by the hidden layer(s) of the network, the "marginal likelihood" of x:

We can see z as being an "encoding" of th eimage x (for example, the activations of the binary hidden units in the RBM), which can be decoded(for example, run the RBM in reverse in order to sample an image) to get a reconstruction of x. If the encoding is "good," the reconstruction will be close to the original image. Because these networks encode and decode representations of their input data, they are also known as "autoencoders."

The ability of deep neural networks to capture the underlying structure of complex data is one of their most attractive features; as we saw with the DBN model in Chapter 4, Teaching Networks to Generate Digits, it allows us to improve the performace of a clasifier by creating a better underlying model for the distrubution of the data. It can also be used to simply create a better way to "compress" the complexity of data, in a similar way to principal component analysis(PCA) in clasical statistice. In Figure 5.2, you can see how the stacked RBM model can be used as a way to encode the distribution of faces, for example.

We start with a "pre-training" phase to create a 30-unit encoding vector, which we then calibrate by forcing it to reconstruct the input image, before fine-tuning with standard backpropagation:

As an example of how the stacked RBM model can more effectively represent the distribution of images, the authors of the paper Reducing the Dimensionality of Data with Neural networks, from which Figure 5.2 is derived, demonstrated using a two-unit code for the MNIST digits compared to PCA:

On the left, we see the digits 0-9(represented by different shades and shapes) encoded using 2-dimensional PCA. Recall that PCA is generated using a low-dimensional factorization of the covariance matrix of the data:

Where  Cov(X) is the same height/width M as the data(for example, 28 by 28 picels in MNIST) and U and V are both lower dimensional (M * K and K * M), where k is much smaller than M. Because they have a smaller number of rows/columns k than the original data in one dimension, U and V are lower-dimensional represetations of the data, and we can get an encoding of an individual image by projecting it onto these k vectores, giving a k unit encoding of the data. Since the decomposition (and projection) is a linear transformation (multiplying two matrices), the ability of the PCA components to distinguish data well depends on the data being linearly separable(we can draw a hyperplane through the space between groups-that space could be two-dimensional or N dimensional, like the 784 pixels in the MNIST images).

As you can see in Figure 5.3, PCA yields overlapping codes for the images, showing that it is challenging to represent digits using a two-component linear decomposition, in which vector representing the same digit are close together, while those representing different digits are clearly separated. Conceptually, the neural network is able to capture more of the variation between images representing different digits than PCA, as shown by its ability to separate the representations of these digits more clearly in a two-dimensional space.

As an analogy to understand this phenomenon, consider a very simple two-dimensional dataset consisting of parallel hyperbolas (2 power polynomials)

At the top, even though we have two distinct classes, we cannot draw a straight line through two-dimensional space to separate the two groups; in a neural network, the weight matrix in a single layer before the nonlinear transformation of a sigmoid or tanh is, in essence, a linear boundary of this kind. However, if we apply a nonlinear transformation to our 2D coordinates, such as taking the square root of the hyperbolas, we can create two separable planes(Figure 5.4, bottom).

A similar phenomenon is at play with our MNIST data: we need a neural network in order to place these 784-digit images into distinct, separable regions of space. 

This goal is achieved by performing a non-linear transformation ono the original, overlapping data, with an objective funciton that reqards increasing the spatial separation among vectors encoding the images of different digits. A separable representation thus increases the ability of the neural network to differentiate image clases using these representations. Thus, in Figure 5.3, we can see on the right that applying the DBN model creates the required non-linear trasformation to separate the different images.

Now that we've covered how neural networks can compress data into numercal vectors and what some desirable properties of those vector representation are, we'll examine how to optimally compress information in these vectors. To do so, each element of the vector should encode distinct information from the others, a property we can achieve using a variational objective. This variational objective is the building block for creating VAE networks.


5. Painting Pictures with Neural Networks Using VAEs

 As you saw in Chapter 4, Teaching Networks to Generate Digits, deep neural networks are a powerful tool for creating generative models for complex data such as images, allowing us to develop a network that can generate images from the MNIST hand-drawn digit database. In that example, the data is relatively simple; images can only come from a limited set of categories (the digits 09 through 9) and are low-resolution grayscale data.

What about more complex data, such as color images drawn from the real world? One example of such "real world" data is the Canadian Institute for Advanced Research 10 class dataset, denoted as CIFAR-10. It is a subset of 60,000 examples from a larger set of 80 million images, divided into ten classes - airplanes, cars, birds, cats, deep, dogs, frogs, horses, ships, and trucks. While still an extremely limited set in terms of the diversity of images we would encounter in the real world, these classes have some characteristics that make them more complex then MNIST. For example, the MNIST digits can vary in width, curvature, and a few other properties; the CIFAR-10 classes have a much wider potential range of variation for animal or vehicle photos, meaning we may require more complex models in order to capture this variation.

In this chapter, we will discuss a class of generative models known as Variational Autoencoders(VAEs), which are designed to make the generation of these complex, real-world images more tractable and tunable. Then do this by using a number of clever simplifications to make it possible to sample over the complex probability distribution represented by real-world images in a way that is scalable.

We will explore the following topics to reveal how VAEs works:

- How neural networks create low-dimensional representations of data, and some desirable properties of those representations

- How variational methods allow us to sample from complex data using these representations

- How using the reparameterization trick allows us to stabilize the variance of a neural network based on variational sampling- a VAE

- How we can use Inverse Autoregressive Flow(IAF) to tune the output of a VAE

- How to implement VAE/IAE in TensorFlow


Summary

 In this cahpter, you learned about one of the most important models from the beginings of the deep learning revolution, the DBN. You saw that DBNs are constructed by stacking together RBNs, and how these undirected models can be trained using CD.

This chapter then describeed a greedy, layer-wise procedure for priming a DBN by sequentitally training each of a stack of RBMs, which can then be fine-truned using the wake-sleep algorithm or backpropagation. We then explored pracical examples of using the TensorFlow 2 API to create an RBM layer and a DBN model, illustraing the use of the GradientTape class to compute update using CD.

You also learned how, following the wake-sleep algorithm, we can compile the DBN as a normal Deep Neural Network and perform backpropagation for upervised training. We applied these models to MNIST data and saw how an RBM can generate digits after training converges, and has features resembling the convolutional filters described in Chapter 3, Building Blocks of Deep Neural Networks.

While the examples in the chapter involved significantly extending the basic layer and model classes of the TensorFlow keras API, they should give you an idea of how to implement your own low-level alternative training procedures. Going forward, we will mostly stick to using the standard fit() and predict() methods, starting with our next topic, Variational Autoencodres, a sophisticated and computationally efficient way to generate image data.


Createing a DBM with the Keras Model API

 You have now seen how to create a single-layer RBM to generate images; this is the building block required to create a full-fledged DBN. Usually, for a model in TensorFlow 2, we only need to extend tf.keras.Model and define an initialization (where the layers are defined) and a call function(for the forward pass). For out DBN model, we also need a few more custom functions to define its behavior.

First, in the initialization, we need to pass a list of dictionaries that contain the parameters for our RBM layers(number_hidden_units, number_visible_units, learning_rate,cd_steps):

class DBN(tf.keras.Model):

    def __init__(self, rbm_params=None, name='deep_belief_network', num_epochs=100, tolerance=1e-3, batch_size=32, shuffle_buffer=1024, **kwargs):

        super().__init__(name=name, **kwrgs)

        self._rbm_params = rbm_params

        self._rbm_layers = list()

        self._dense_layers = list()

        for num, rbm_param in enumerate(rbm_params):

            self._rbm_layers.append(RBM(**rbm_param))

            self._rbm_layers[-1].build([rbm_param["number_visible_units"]])

            if num < len(rbm_params)-1:

                self._dense_layers.append(

                    tf.keras.layers.Dense(rbm_param["number_hidden_units"], activation=tf.nn.sigmoid))

            else:

                self._dense_layers.append(tf.keras.layers.Dense(rbm_param[ "number_hidden_units"], activation =tf.nn.softmax))

            self._dense_layers[-1].build([rbm_param["number_visible_units"]])

            self._num_epochs = num_epochs

            self._tolerance = tolerance

            self._batch_size = batch_size

            self._shuffle_buffer = shuffle_buffer

Note at the same time that we also initialize a set of sigmoidal dense layers with a softmax at the end, which we can use for fine-tuning through backpropagation once we've trained the model using the generative procedures outlined earlier. To train the DBN, we begin a new code block to start the generative learning process for the stack of RBMs:

#pretraining:

inputs_layers = []

for num in range(len(self._rbm_layers)):

    if num == 0:

        inputs_layers.append(inputs)

        self._rbm_layers[sum] = \

            self.train_rbm(self._rbm_layers[num], inputs)

    else:    #pass all data through previous layer

        inputs_layers.append(inputs_layers[num-1].map(self._rbm_layers[num-1].forward))

        self._rbm_layers[num] = \

            self.train_rbm(self._rbm_layers[num], inputs_layers[num])

Notice that for computational efficiency, we generate the input for each layer past the first by passing every datapoint though the prior layer in a forward pass using the map() function for the Dataset API, instead of having to generate these forward samples repeatedly. While this takes more memory, it greatly reduces the computation required. Each layer in the pre-training loop calls back to the CD loop you saw before, which is now a member function of the DBN class:

def train_rbm(self, rbm, inputs, num_epochsd, tolerance, batch_size, shuffle_buffer):

    last_cost = None

    for epoch in range(num_epochs):

        cost = 0.0

        count = 0.0

        for datapoints in inputs.shuffle(shuffle_buffer).batch(batch_size).take(1):

            cost += rbm.cd_update(datapoints)

            count += 1.0

            cost /= count

            print("epoch: {}, cost: {}".format(epoch, cost))

            if last_cost and abs(last_cost-cost) <= tolerance:

                break

            last_cost = cost

        return rbm

Once we have pre-trained in a greedy manaer, we can proceed to the wake-sleep step. We start with the upward pass:

# wake-sleep:

for epoch in range(self._num_epochs):

    # wake pass

    inputs_layers = []

    for num, rbm i n enuerate(self._rbm_layers):

        if num == 0:

            inputs_layers.append(inputs)

        else:

            inputs_layers.append(inputs_layers[num-1].map(self._rbm_layers[num-1].forward))

    for num, rbm in enuerate(self._rbm_layers[:-1]):

        cost = 0.0

        count = 0.0

        for datapoints in inputs_layers[num].shuffle(self._shuffle_buffer).batch( self._batch_size):

            cost += self._rbm_layers[num].wake_update(datapoints)

            count += 1.0

        cost /= count

        print("epoch: {}, wake_cost: {}", format(epoch, cost))

Again, note that we gather a list of the transformed forward passes at each stage so that we have the necessary inputs for the update formula. We've now added a function, wake_update, to the RBM class, which will compute updates only for the generative(download) weights, in every layer except the last(the associate, undirected connections):

def wake_update(self, x):

    with tf.GradientTape(watch_accessed_variables=False) as g:

        h_sample = self.sample_h(x)

        for step in range(self.cd_steps):

            v_sample = self.sample_v(h_sample)

            h_sample = self.sample_h(v_sample)

        g.watch(self.w_gen)

        g.watch(self.vb)

        cost = tf.reduce_mean(self.free_energy(x)) - tf.reduce_mean(self.free_energy_reverse(h_sample))

    w_grad, vb_grad = g.gradient(cost, [self.w_gen, self.vb])

    self.w_gen.assign_sub(self.learning_rate * w_grad)

    self.vb.assign_sub(self.learning_rate * vb_grad)

    return self.reconstruction_cost(x).numpy()

This is almost identiacal to the CD update, except that we are only updating the generative weights and the visible unit bias terms. Once we compute the forward pass, we then perform a contrastive update on the associate memory in the top layer.

#top-level associative:

self._rbm_layers[-1]=self.train_rbm(self._rbm_layers[-1], inputs_layers[-2].map( self._rbm_layers[-2].forward), num_epochs=self._num_epochs, tolerance=self._tolerance, batch_size=self._batch_size, shuffle_buffer=self._shuffle_buffer)

We then need to compute the data for the reverse pass of the wake-sleep algorithm; we do this by again applying a mapping to the last layer input:

reverse_inputs = inputs_layers[-1].map(self._rbm_layers[-1].forward)

For the sleep pass, we need to traverse the RBM in reverse, updating only the non-associative (undirected) connections. We first need to map the required input for each layer in reverse:

reverse_inputs_layers = []

    for num, rbm in enumerate(self._rbm_layers[::-1]):

        if num == 0:

            reverse_inputs_layers.append([reverse_inputs)

        else:

            reverse_inputs_layers.append(reverse_inputs_layers[num-1].map( self._rbm_layers[len(self._rbm_layers)-num].reverse))

Then we perform a backward traversal of the layers, only updating the non-associative connections:

for num, rbm in enumerate(self._rbm_layers[::-1]):

    if num > 0:

        cost = 0.0

        count = 0.0

        for datapoints in reverse_inputs_layers[num].shuffle(self._shuffle_buffer).batch (self._batch_size):

            cost += self._rbm_layers[len(self._rbm_layers)-1-num].sleep_update(datapoints)

            count += 1.0

        cost /= count

        print("epoch: {}, sleep_cost: {}".format(epoch, cost))

Once we are satisfied with the training progress, we can turne the model further using normal backpropagation. The last step in the wake-sleep procedure is to set all the dense layers with the results of the trained weights from the RBM layers:

for dense_layer, rbm_layer in zip(dbn._dense_layers, dbn._rbm_layers):

    dense_layer.set_weights([rbm_layer.w_rec.numpy(), rbm_layer.hb.numpy()]

We have included a forward pass for a neural network in the DBN classes using the call functions():

def call(self, x, training):

    for dense_layer in self._dense_layers:

        x = dense_layer(x)

    return x

This can be used in the fit() call the TensorFlow API:

dbn.compile(loss=tf.keras.losses.CategoricalCrossentropy())

dbn.fit(x=mnist_train.map(lambda x: flattern_image(x, label=True)).batch(32),)

This begins to train the now pre-trained weights using backpropagation, to fine-tune the discriminative power of the model, One way to conceptually understand this fine-tuning it that the pre-training procedure guides the weights to a reasonable configuration that captures the "shape" of the data, which backpropagation can then tune for a particular classification task. Otherwise, starting from a completely random weight configuration, the parameters are too far from capturing the variation in the data to be efficiently navigated to an optimal configuration through backpropagation alone.

You have seen how to combine multiple RBMs in layers to create a DBN, and how to run a generative learning process on the end-to-end model using the TensorFlow 2 API; in particular, we made use of the gradient tape to allow us to record and replay the gradients using a non-standard optimization algorithm (for example, not one of the default optimizers in the TensorFlow API), allowing us to plug a custom gradient update into the TensorFlow framework.