You have seen that an RBM with a single hidden layer can be used to learn a generative model of images; in fact, theoretical work has suggested that with a sufficiently large number of hidden units, an RDB can approximate any distribution with binary values. However, in practice, for very laarge input data, it may be more efficient to add additiional layers, instead of single large layer, allowing a more "compact" representation of the data.
Researchers who developed DBNs also noted that adding additional layers can only lower the log likehoood of the lower bound of the approximation of the data reconstructed by the generative model. In this case, the hidden layer output h of the first layer becomes the input to a second RDB; we can keep adding other layers to make a deeper network. Furthermore, if we wanted to make to make this network capable of learning not only the distribution of the image(x) but also the label-which digit it represent from 0 to 9(y) - we could add yet another layer to a stack of connected RBMs that is a probability distribution (softmax) over the 10 possible digit classes.
A problem with training a very deep graphical model such as stacked RDBs is the "explaining-away effect" that we discussed in Chapter 3, Building Blocks of Deep Neural Networks. Recall that the dependency between variables can complicate inference of the state of hidden variables:
In Figure 4.8, the know ledge that the pavement is wet can be explained by a sprinkler being turned on, to the extent that the presence of absence of rain becomes irrelevant, meaning we can't meaningfully infer the probability that it is raining, This is equivalent to saying that the posterior distribution (Chapter 1, An Introduction to Generative AI: "Drawing" Data from Models) of the hidden units cannot be tractably computed, since they are correlated, which interferes with easily sampling the hidden state of the RPM.
One solution is to treat each of the units as independent in the likelihood function, which is known as variational inference; while this works in paractice, it is not a satisfying solution given that we know that these units are in fact correlated.
But where does this correlation come from? If we sample the state of the visible units in a single-layer RBM, we set the states of each hidden unit randomly since they are independent; thus the prior distribution over the hidden units is independent. Why is the posterior the correlated? Just as the knowledge(data) that the pavement is wet causes a correlation between the probabilities of a sprinkler and rainy weather, the correlation between pixel values causes the posterior distribution of the hidden units to be non-independent. This is because the pixels in the images aren't set randomly; based on which digit the image represents, groups of pixels are more a less likely to be bright of dark. In the 2006 paper A Fast Learning Algorithum for Deep Belif Nets, the authors hypothesized that this problem could be solved by computing a complementary prior that has exactly the opposite correlation to the likelihood, thus canceling out this dependence and making the posterior also independent.
To compute this complementary prior, we could use the posterior distribution over hidden units in a higher layer. The trick to generating such distributions is in a greedy, layer-wise procedure for "priming" the network of stacked RBMs in a multilayer generative model, such that the weights can then be fine-tuned as a classification model. For example, let's consider a three-layer model for the MNIST data(Figure 4.9):
The two 500-unit layers form representations of the MNIST digits, while the 2000 and 10-unit layers are "associative memory" that correlates labels with the digit representation. The two first layers have directed connections (different weights) for upsampling and downsampling, while the top layers have undirected weights(the same weight for forward and backward passes).
This model could be learned in stages. For the first 500-unit RBM, we would treat it as an undirected model by enforcing that the forward and backward weights are equal; we would then use CD to learn the parameters of this RBM. We would then fix these weights and learn a second(500-unit) RBM that uses the hidden units from the first layer as input "data," and repeat for the 2000-layer unit.
After we have "primed" the network, we no longer need to enforce that the weights in the bottom layers are tied, and con fine-tune the weights using an algorithm known as "wake-sleep."
Firstly, we take input data(the digits) and compute the activations of the other layers all the way up until the connections between the 2000-and 10-unit layers. We compute updates to the "generative weights" (those that compute the activations that yield image data from the network) pointing downward using the previously given gradient equations. This is the "wake" phase because if we consider the network as resembling a biological sensory system, then it receives input from the environment through this forward pass.
For the 2000-and 10-unit layers, we use the sampling procedure for CD using the second 500-unit layer's output as "data" to update the undirected weights.
We then take the output of the 2000-layer unit and compute activations downward, updating the "recognition weights" (those that compute activations that lead to the classification of the image into one of the digit classes) pointing upward. This is called the "sleep" phase because it displays what is in the "memory" of the network, rather than taking data from outside.
We then repeat these steps until convergence.
Note that in practice, instead of using undirected weights in the top layers of the network, we could replace the last layer with directed connections and a softmax classifier. This network would then technically no longer be a DBN, but rather a regular Deep Neural Network that we could optimize with backpropagation. This is an approach we will take in our own code, as we can then leverage TensorFlow's built-in gradient calculations, and it fits into the paradigm of the Model API.
Now that we have covered the theoretical background to understand how a DBN is trained and how the pre-training approach resolves issues with the "explaining-away" effect, we will implement the whole model in code, showing how we can leverage TensorFlow 2's gradient taps functionality to implement CD as a custom learning algorithm.
댓글 없음:
댓글 쓰기