페이지

2022년 3월 5일 토요일

Contrastive divergence: Approximating a gradient

 If we refer back to Chapter 1, An IOntroduction to Generative AI: "Drawing" Data from Models, creating a generative model of images using an RBM essentially involves finding the probability ddistribution of images, using the energy equation

wehre x is an image, theta is the paramethers of the model(the weights and biases), and Z is the partition function;

In oder to find the parameters that optimaize this distribution, we need to maximize the likelihood(product of each datapoint's probability under a density function) based on data:

In practice, it's bit easier to use the negative log likehood, as this is represented by a sum:

If the distribution f has a simple form, then we can just take the derivative of E with respect to parameters of f. For example, if f is a single normal distribution, then the values that maximize E with respect to mu(the mean)and sigma(the standard deviation) are, respectively, the sample mean and standard deviation; the partition function Z doesn't effect this calculation because the integral is 1, a constant, which becomes 0 once we take the logarithm.

If the distribution is instead a sum of N normal distributions, then the partial derivative of mu(i)(one of these distributions) with respect to f(the sum of all the N normal distributions) involves the mu and sigma of each other distribution as well. Because of this dependence, there is no closed-form solution(for example, a solution equation we can write out by rearranging terms or applying algebraic trasformations) for the optimal value; instead, we need to use a gradient search method (such as the backpropagation algorithm we discussed in Chapter 3, Building Blocks of Deep Neural Networks) to iteratively find the optimal value of this function. Again, the inegral of each of these N distributions is 1, meaning the partition function is the constant log(N), making the derivative 0.

What happens if the distribution f is a product, instead of a sum, of normal distributions? The partition function Z is now no longer a constatn in this equation with respect to theta, the parameters; the value will depend on how and where these function overlap when computing the integral-they could cancel each other out by being mutually exclusive(0) or overlapping(yielding a value greater than 1). In order to evaluate gradient descent steps, we would need to be able to compute this partition function using numerical methods. In the RBM example, this partition function for the configuration of 28 * 28 MNIST digits would have 784 logistic units, and a massive number (2786) of possible configurations, making it unwieldy to evaluate every time w ant to take a gradient.

Is there any other way we could optimize the value of this energy equation without taking a full gradient? Returning to the energy equation, let's write out the gradient explicitly:

The partition function Z can be further written as a function of the integral involving X and the parameters of f;

where <> represents an average over the observed data sampled from the distribution of x. In order words, we can approximate the integral by sampling from the data and computing the average, which allows us to avoid computing or approximating high-dimensional integrals.

While we can't directly sample from p(x), we can use a technique known as MarkovChain Monte Carlo(MCMC) sampling to generate data from the target distribution p(x). As was described in our discussion on Hopfield networks, the "Markov" property means that this sampling only uses the last sample as input in determining the probability of the next datapoint in the simulation-this forms a "chain" in which each successive sampled datapoint becomes input to the next.

The "Monte Carlo" in the name of this technique is a reference to a casino in the principality of Monaco, and denotes that, like the outcomes of gambling, these samples are generated through a random process. By generating these random samples, you can use N MCMC steps as an approximation of the average of a distribution that is otherwise difficult or impossible to integrate. When we put all of this together, we get the following gradient equation:

where X represents the data at each step in the MCMC chain, with X being the input data. While in theory you might think it would take a large number of steps for the chain to converge, in practice it has been observed that even N=1 steps is enough to get a decent gradient approximation.

Notice that the end result is a contrast between the input data and the sampled data; thus, the method is named contrastive divergence as it involves the difference between two distributions.

Applying this to our RBM example, we can follow this recipe to generate the required samples:

1. Take an input vector v

2. Compute a "hidden" activation h

3. Use the activation from to generate a sampled visible state v

4. Use to generate a sampled hidden state h

5. Compute the updates, which are simply the correlations of the visible and hidden uinits:


where b and c are the bias terms of visible and hidden units, respectively, and e is the learning rate.

This sampling is known as Gibbs sampling, a method in which we sample one unknown parameter of a distribution at a time while holding all others constant. Here we hold the visible or the hidden fixed and sample units in each step. With CD, we now have a way to perform gradient descent to learn the parameters of our RDB model; as it turns out, we can potentially compute an even better model by stacking RDBs in what is called a DBN.






댓글 없음: