In our discussion earlier, it was noted that we want to use q(z|x) as a way to approximate the "true" p(z|x) that would allow us to generate an ideal encoding of the data, and thus sample from it to generate new images. So far, we've assumed that q(z|x) has a relatively simple distribution, such as a vector of Gaussian distribution random variables that are independent(a diagonal covariance matrix with 0s on the non-diagonal elements). This sort of distribution has many benefits; because it is sample, we have an easy way to generate new samples by drawing from random normal distributions, and because it is independent, we can separately tune each element of the latent vector z to influence parts of the output image.
However, such a simple distribution may not fit the desired output distribution of data well, increasing the KL divergence between p(z|x) and q(z|x). Is ther a way we can keep the desirable properties of q(z|x) but "transform" z so that it captures more of the complexities needed to represent x?
One approach is to apply a series of autoregressive transformations to z to turn it from a simple to a complex distribution; by "autoregressive," we mean that each transformation utilizes both data from the previous transformation and the current data to compute an updated version of z. In contrast, the basic form of VAE that we introduced above has only a single "transformation:" from z to the output(though z might pass through multiple layers, there is no recursive network link to further refine that output). We've seen such transformations before, such as the LSTM networks in Chapter 3, Building Blocks of Deep Neural Networks, where the output of the network is combination of the current input and a weighted version of prior time step.
An attractive property of the independent q(z|x) distributions we discussed earlier, such as independent normals, is that they have a very tractable expression for the log likelihood. This property is important for the VAE model because its objective function depends on integrating over the whole likelihood function, which would be cumbersome for more complex log likelihood functions. However, by constraining a transformed z to computation through a series of autoregressive transformations, we have the nice property that the log-lieklihood of step t only depends on t-1, thus the jacobian (gradient matrix of the partial derivative between t and t-1)is lower triangular and can be computed as a sum:
What kinds of trasformations f could be used? Recall that after the parameterization trick, z is a function of a noise element e and the mean and standard deviation output by the encoder Q:
If we apply successive layers of transformation, step t becomes the sum of u and the element-wise product of the prior layer z and the sigmoidal output
In practice, we use a neural network transformation to stabilize the estimate of the mean at each step:
Again, note the similarity of this transformation to the LSTM networks discussed in Chapter 3, Building Blocks of Deep Neural Networks, In Figure 5.8, there is another output (h) from the encoder Q in addition to the mean and standard deviation in order to sample z. H is, in essence, "accessory data" that is passed into each successive transformation and, along with the weighted sum that is being calculated at each step, represents the "persistent memory" of the network in a way reminiscent of the LSTM.
댓글 없음:
댓글 쓰기