We previously covered several examples of how images can be compressed into numerical vectors using neural networks. This section will introduce the elements that allow us to create effective encodings to sample new images from a space of random numerical vectors, which are principally efficient inference algorithms and appropriate objective functions, Let's start by quantifying more rigorously what make such an encoding "good" and allows us to recreate images well. We will need to maximize the posterior:
p(z|w) = p(x|z)p(z)/p(x)
A problem occurs when the probability of x is extremely high dimensional, which, as you saw, can occur in even simple data such as binary MNIST digits, where we have 2^(number of pixels) possible configurations that we would need to integrate over (in a mathematical sense of integrating over a probability distribution) to get a measure of the probability of an individual image; in other words, the density p(x) is intractable, making the posterior p(z|x), which depends on p(x), likewise intractable.
In some cases, such as you saw in Chapter 4, Teaching Networks to generate Digits, we can use simple cases such as binary units to compute an approximation such as contrastive divergence, which allows us to still compute a gradient even if we can't calculate a closed from. However, this might also be challenging for very large datasets, where we would need to make many passes over the data to compute an average gradient using Contrastive Divergence(CD), as you saw previously in Chapter 4, Teaching Networks to Generate Digits.
If we can't calculate the distribution of our encoder p(z|x) directly, maybe we could optimaize an approximation that is "close enough"-let's called this q(z|x). Then, we could use a measure to determine if the distributions are close enough. One useful measure of closeness is whether the two ditributions encode similar information; we can quantify information using the Shannon Information equation:
l(p(x)) = -log(p(x))
Consider why this is a good measure: as p(x) decreases, an event becomes rarer, and thus observation of the event communicates more information about the system or dataset, leading to a positive value of -log(p(x)). Conversely, as the probability of an event nears 1, that event encodes less information about the dataset, and the value of -log(p(x)) becomes 0
Thus, if we wanted to measure the difference between the information encoded in two distributions, p and q, we could use the difference in their information:
l(p(x)) - l(q(x)) = -log(p(x)) + log(q(x)) = log(q(x)/p(x))
Finally, if we want to find the expected difference in information between the distributions for all elements of x, we can take the average over p(x):
This quantity is known as the Kulback Leibler(KL) Divergence. It has a few interesting properties:
1. It is not symmetric: KL(p(x), q(x)) does not, in general, equal KL(q(x)), p(x)), so the "closeness" is measured byu mapping one distribution to another in a particular direction.
2. Whenever q(x) and p(x) match, the term is 0, meaning they are a minimum distance from one another, Likewise, KL(p(x)), q(x)) is 0 only if p and q are identical.
3. If q(x) is 0 or p(x) is 0, then KL is undefined; by definition, it only computes relative information over the range of x where the two distributions match.
4. KL is always greater than 0.
If we were to use the KL divergence to compute how well an approximation q(z,x)is of our intractable p(z|x), we could write:
Now we can write an expression for our intractable p(x) as well: since log(p(x)) does not depend on q(z|x), teh expectation with respect to p(x) is simply log(p(x)). Thus, we con represent the objective of the VAE, learning the marginal distributiton of p(x), using the KL divergence:
The second term is also known as the Variational Lower Bound, which is also referred to as the Evidence Lower Bound(ELBO); since KL(q,p) is strictly greater than 0, log(p(x)) is strictly greater than or (if KL(q,p) is 0) equal to this value.
To explain what this objective is doing, notice that the expectation intruduces a difference between q(z|x)(encoding x) and p(x|z)p(z) (thie joint probability of the data and the encoding); thus we want to minimize a lower bound that is essentially the gap betwwen the probablity of the encoding and the joint probability of the encoding and data, with an error term given by KL(q,p), the difference betwwen a tractable approximation and intractable form of the encoder p(z|x), the difference between a tractable approximation and intractable form of the encoder p(z|x), We can imagine the functions Q(z|x) and P(x|z) being represented by two deep neural networks; one generates the latent code z(Q), and the other reconstructs x from this code(P). We can imagine this as an autoencoder setup, as above with the stacked RBM models, with an encoder and decoder:
We want to optimize the parameters of the encoder Q and the decoder P to minimize the reconstruction cost. One way to do this is to construct Monte Carlo samples to optimize the parameters of Q using gradient descent:
However, it has been found in practice that a large number of samples may be required in order for the variance of these gradient updates to stabilize.
We also have a practical problem here: even if we could choose enough sample to get a good approximation of the gradients for the encoder, out network ontains a stochasitc, non-differentiable step (sampling z) that we can't backpropagate through, in a similar way we couldn't backpropagate through the stochastic units in the RBM in Chapter 4, Teaching Networks to Generate Digits. Thus, our reconstruction error depends on samples from z, but we can't backpropagate through the step that generates these samples to tune the network end to end. Is there a way we can create a differentialbe decoder/encoder architecture while also reducting the variance of sample estimates? One of the main insights of the VAE is to enable this through the "reparameterization trick."