페이지

2022년 3월 12일 토요일

5. Painting Pictures with Neural Networks Using VAEs

 As you saw in Chapter 4, Teaching Networks to Generate Digits, deep neural networks are a powerful tool for creating generative models for complex data such as images, allowing us to develop a network that can generate images from the MNIST hand-drawn digit database. In that example, the data is relatively simple; images can only come from a limited set of categories (the digits 09 through 9) and are low-resolution grayscale data.

What about more complex data, such as color images drawn from the real world? One example of such "real world" data is the Canadian Institute for Advanced Research 10 class dataset, denoted as CIFAR-10. It is a subset of 60,000 examples from a larger set of 80 million images, divided into ten classes - airplanes, cars, birds, cats, deep, dogs, frogs, horses, ships, and trucks. While still an extremely limited set in terms of the diversity of images we would encounter in the real world, these classes have some characteristics that make them more complex then MNIST. For example, the MNIST digits can vary in width, curvature, and a few other properties; the CIFAR-10 classes have a much wider potential range of variation for animal or vehicle photos, meaning we may require more complex models in order to capture this variation.

In this chapter, we will discuss a class of generative models known as Variational Autoencoders(VAEs), which are designed to make the generation of these complex, real-world images more tractable and tunable. Then do this by using a number of clever simplifications to make it possible to sample over the complex probability distribution represented by real-world images in a way that is scalable.

We will explore the following topics to reveal how VAEs works:

- How neural networks create low-dimensional representations of data, and some desirable properties of those representations

- How variational methods allow us to sample from complex data using these representations

- How using the reparameterization trick allows us to stabilize the variance of a neural network based on variational sampling- a VAE

- How we can use Inverse Autoregressive Flow(IAF) to tune the output of a VAE

- How to implement VAE/IAE in TensorFlow


Summary

 In this cahpter, you learned about one of the most important models from the beginings of the deep learning revolution, the DBN. You saw that DBNs are constructed by stacking together RBNs, and how these undirected models can be trained using CD.

This chapter then describeed a greedy, layer-wise procedure for priming a DBN by sequentitally training each of a stack of RBMs, which can then be fine-truned using the wake-sleep algorithm or backpropagation. We then explored pracical examples of using the TensorFlow 2 API to create an RBM layer and a DBN model, illustraing the use of the GradientTape class to compute update using CD.

You also learned how, following the wake-sleep algorithm, we can compile the DBN as a normal Deep Neural Network and perform backpropagation for upervised training. We applied these models to MNIST data and saw how an RBM can generate digits after training converges, and has features resembling the convolutional filters described in Chapter 3, Building Blocks of Deep Neural Networks.

While the examples in the chapter involved significantly extending the basic layer and model classes of the TensorFlow keras API, they should give you an idea of how to implement your own low-level alternative training procedures. Going forward, we will mostly stick to using the standard fit() and predict() methods, starting with our next topic, Variational Autoencodres, a sophisticated and computationally efficient way to generate image data.


Createing a DBM with the Keras Model API

 You have now seen how to create a single-layer RBM to generate images; this is the building block required to create a full-fledged DBN. Usually, for a model in TensorFlow 2, we only need to extend tf.keras.Model and define an initialization (where the layers are defined) and a call function(for the forward pass). For out DBN model, we also need a few more custom functions to define its behavior.

First, in the initialization, we need to pass a list of dictionaries that contain the parameters for our RBM layers(number_hidden_units, number_visible_units, learning_rate,cd_steps):

class DBN(tf.keras.Model):

    def __init__(self, rbm_params=None, name='deep_belief_network', num_epochs=100, tolerance=1e-3, batch_size=32, shuffle_buffer=1024, **kwargs):

        super().__init__(name=name, **kwrgs)

        self._rbm_params = rbm_params

        self._rbm_layers = list()

        self._dense_layers = list()

        for num, rbm_param in enumerate(rbm_params):

            self._rbm_layers.append(RBM(**rbm_param))

            self._rbm_layers[-1].build([rbm_param["number_visible_units"]])

            if num < len(rbm_params)-1:

                self._dense_layers.append(

                    tf.keras.layers.Dense(rbm_param["number_hidden_units"], activation=tf.nn.sigmoid))

            else:

                self._dense_layers.append(tf.keras.layers.Dense(rbm_param[ "number_hidden_units"], activation =tf.nn.softmax))

            self._dense_layers[-1].build([rbm_param["number_visible_units"]])

            self._num_epochs = num_epochs

            self._tolerance = tolerance

            self._batch_size = batch_size

            self._shuffle_buffer = shuffle_buffer

Note at the same time that we also initialize a set of sigmoidal dense layers with a softmax at the end, which we can use for fine-tuning through backpropagation once we've trained the model using the generative procedures outlined earlier. To train the DBN, we begin a new code block to start the generative learning process for the stack of RBMs:

#pretraining:

inputs_layers = []

for num in range(len(self._rbm_layers)):

    if num == 0:

        inputs_layers.append(inputs)

        self._rbm_layers[sum] = \

            self.train_rbm(self._rbm_layers[num], inputs)

    else:    #pass all data through previous layer

        inputs_layers.append(inputs_layers[num-1].map(self._rbm_layers[num-1].forward))

        self._rbm_layers[num] = \

            self.train_rbm(self._rbm_layers[num], inputs_layers[num])

Notice that for computational efficiency, we generate the input for each layer past the first by passing every datapoint though the prior layer in a forward pass using the map() function for the Dataset API, instead of having to generate these forward samples repeatedly. While this takes more memory, it greatly reduces the computation required. Each layer in the pre-training loop calls back to the CD loop you saw before, which is now a member function of the DBN class:

def train_rbm(self, rbm, inputs, num_epochsd, tolerance, batch_size, shuffle_buffer):

    last_cost = None

    for epoch in range(num_epochs):

        cost = 0.0

        count = 0.0

        for datapoints in inputs.shuffle(shuffle_buffer).batch(batch_size).take(1):

            cost += rbm.cd_update(datapoints)

            count += 1.0

            cost /= count

            print("epoch: {}, cost: {}".format(epoch, cost))

            if last_cost and abs(last_cost-cost) <= tolerance:

                break

            last_cost = cost

        return rbm

Once we have pre-trained in a greedy manaer, we can proceed to the wake-sleep step. We start with the upward pass:

# wake-sleep:

for epoch in range(self._num_epochs):

    # wake pass

    inputs_layers = []

    for num, rbm i n enuerate(self._rbm_layers):

        if num == 0:

            inputs_layers.append(inputs)

        else:

            inputs_layers.append(inputs_layers[num-1].map(self._rbm_layers[num-1].forward))

    for num, rbm in enuerate(self._rbm_layers[:-1]):

        cost = 0.0

        count = 0.0

        for datapoints in inputs_layers[num].shuffle(self._shuffle_buffer).batch( self._batch_size):

            cost += self._rbm_layers[num].wake_update(datapoints)

            count += 1.0

        cost /= count

        print("epoch: {}, wake_cost: {}", format(epoch, cost))

Again, note that we gather a list of the transformed forward passes at each stage so that we have the necessary inputs for the update formula. We've now added a function, wake_update, to the RBM class, which will compute updates only for the generative(download) weights, in every layer except the last(the associate, undirected connections):

def wake_update(self, x):

    with tf.GradientTape(watch_accessed_variables=False) as g:

        h_sample = self.sample_h(x)

        for step in range(self.cd_steps):

            v_sample = self.sample_v(h_sample)

            h_sample = self.sample_h(v_sample)

        g.watch(self.w_gen)

        g.watch(self.vb)

        cost = tf.reduce_mean(self.free_energy(x)) - tf.reduce_mean(self.free_energy_reverse(h_sample))

    w_grad, vb_grad = g.gradient(cost, [self.w_gen, self.vb])

    self.w_gen.assign_sub(self.learning_rate * w_grad)

    self.vb.assign_sub(self.learning_rate * vb_grad)

    return self.reconstruction_cost(x).numpy()

This is almost identiacal to the CD update, except that we are only updating the generative weights and the visible unit bias terms. Once we compute the forward pass, we then perform a contrastive update on the associate memory in the top layer.

#top-level associative:

self._rbm_layers[-1]=self.train_rbm(self._rbm_layers[-1], inputs_layers[-2].map( self._rbm_layers[-2].forward), num_epochs=self._num_epochs, tolerance=self._tolerance, batch_size=self._batch_size, shuffle_buffer=self._shuffle_buffer)

We then need to compute the data for the reverse pass of the wake-sleep algorithm; we do this by again applying a mapping to the last layer input:

reverse_inputs = inputs_layers[-1].map(self._rbm_layers[-1].forward)

For the sleep pass, we need to traverse the RBM in reverse, updating only the non-associative (undirected) connections. We first need to map the required input for each layer in reverse:

reverse_inputs_layers = []

    for num, rbm in enumerate(self._rbm_layers[::-1]):

        if num == 0:

            reverse_inputs_layers.append([reverse_inputs)

        else:

            reverse_inputs_layers.append(reverse_inputs_layers[num-1].map( self._rbm_layers[len(self._rbm_layers)-num].reverse))

Then we perform a backward traversal of the layers, only updating the non-associative connections:

for num, rbm in enumerate(self._rbm_layers[::-1]):

    if num > 0:

        cost = 0.0

        count = 0.0

        for datapoints in reverse_inputs_layers[num].shuffle(self._shuffle_buffer).batch (self._batch_size):

            cost += self._rbm_layers[len(self._rbm_layers)-1-num].sleep_update(datapoints)

            count += 1.0

        cost /= count

        print("epoch: {}, sleep_cost: {}".format(epoch, cost))

Once we are satisfied with the training progress, we can turne the model further using normal backpropagation. The last step in the wake-sleep procedure is to set all the dense layers with the results of the trained weights from the RBM layers:

for dense_layer, rbm_layer in zip(dbn._dense_layers, dbn._rbm_layers):

    dense_layer.set_weights([rbm_layer.w_rec.numpy(), rbm_layer.hb.numpy()]

We have included a forward pass for a neural network in the DBN classes using the call functions():

def call(self, x, training):

    for dense_layer in self._dense_layers:

        x = dense_layer(x)

    return x

This can be used in the fit() call the TensorFlow API:

dbn.compile(loss=tf.keras.losses.CategoricalCrossentropy())

dbn.fit(x=mnist_train.map(lambda x: flattern_image(x, label=True)).batch(32),)

This begins to train the now pre-trained weights using backpropagation, to fine-tune the discriminative power of the model, One way to conceptually understand this fine-tuning it that the pre-training procedure guides the weights to a reasonable configuration that captures the "shape" of the data, which backpropagation can then tune for a particular classification task. Otherwise, starting from a completely random weight configuration, the parameters are too far from capturing the variation in the data to be efficiently navigated to an optimal configuration through backpropagation alone.

You have seen how to combine multiple RBMs in layers to create a DBN, and how to run a generative learning process on the end-to-end model using the TensorFlow 2 API; in particular, we made use of the gradient tape to allow us to record and replay the gradients using a non-standard optimization algorithm (for example, not one of the default optimizers in the TensorFlow API), allowing us to plug a custom gradient update into the TensorFlow framework.


    


        

2022년 3월 11일 금요일

Creating an RBM using the TensorFlow Keras layers API

 Now that you have an appreciation of some of the theoretical underpinnings of the RBM, let's look at how we can implement it using the TensorFlow 2.0 library. For this purpose, we will represent the RBM as a custom layer type using the Keras layers API.

Code in this chapter was adapted to TensorFlow 2 from the origial Theano (another deep learning Python framework) code from deeplearning.net

Firstly, we extend tf.keras.layer:

from tensorflow.keras import layers

import tensorflow_probability as tfp

class RBM(layers.Layer):

    def __init__(self, umger_hidden_units=10, number_sivible_units=None, learning_rate=0.1, cd_steps=1):

        super().__init__()

        self.number_hidden_units = number_hidden_units

        self.number_visible_units = number_visible_units

        self.learning_Rate = learning_Rate

        self.cd_steps = cd_steps

We input a number of hidden units, visible units, a learning rate for CD updates, and the number of steps to take with each CD pass. For the layers API, we are only required to implement two functions: build() and call(). build() is executed when we call model.compile(), and is used to initialize the weights of the network, including inferring the right size of the weights given the input dimensions:

def build(self, input_shape):

    if not self.number_visible_units:

        self.number_visible_units = input_shape[-1]

        self.w_rec = self.add_weight(shape=(self.number_visible_units, self.number_hidden_units), initializer='random_normal', trainable=True)

        self.w_gen = self.add_weight(shape=(self.number_hidden_units, self.number_visible_units), initilizer='random_normal', trainable=True)

        self.hb = self.add_weight(shape=(self.number_hidden_units,), initializer='random_normal', trainable=True)

        self.vb = self.add_weight(shape=(self.number_visible_units, ), initializer='random_normal', trainable=True)

We also need a way to perform both forward and reverse samples from the model.

For the forward pass, we need to compute sigmoidal activations from the input, and then stochastically turn the hidden units on or off based on the activation probabliity between 1 and 0 given by that sigmoidal activation:

def forward(self, x):

    return tf.sigmoid(tf.add(tf.matmul(x, self.w), self.hb))

def sample_h(self, x):

    u_sample = tfp.distributions.Uniform().sample((x.shape[1], self.hb.shape[-1]))

    return tf.cast((x) > u_sample, tf.float32)

Likewise, we need a way to sample in reverse for the visible units:

def reverse(self, x):

    return tf.sigmoid(tf.add(tf.matmul(x, self.w_gen), self.vb))

def sample_v(self, x):

    u_sample = tfp.distributions.Uniform().sample((x.shape[1], self.vb.shape[-1]))

    return tf.cast(self.reverse(x) > u_sample, tf.float32)

We also implement call() in the RBM class, which provides the forward pass we would use if we were to use the fit() method of the Model API for backpropagation(which we can do for fine-tuning later in our deep belief model):

def call(self, inputs):

    return tf.sigmoid(tf.add(tf.matmul(inputs, self.w), self.hb))

To actually implement CD learning for each RBM, we need to create some additional functions, The first calculates the free energy, as you saw in the Gibbs measure earlier in this chapter:

def free_energy(self, x):

    return -tf.tensordot(x, self.vb, 1)\

        -tf.reduce_sum(tf.math.log(1+tf.math.exp(tf.add(tf.matmul(x, self.w), self.hb))),1)

Note here that we could have used the Bernoulli distribution from tensorflow_probability in order to perform this sampling, using the sigmoidal activations as the probabilities; however, this is slow and would cause performance issues when we need to repetitively sample during CD leraning. Instead, we use a speedup sigmoidal array and then set the hidden unit as 1 if it is greater than the random number. Thus, if a sigmoidal activation is 0.9, it has a 90% probability, of being greater than a randomly sampled uniform number, and is set to "on." This has the same behavior as sampling a Bernoulli variable with a probability of 0.9, but is computationally much more efficient. The reverse and visible samples are computed similarly. Finally, putting these together allows us to perform both forward and reverse Gibbs samples:

def reverse_gibbs(self, x):

    return self.sample_h(self.sample_v(x))

To perform the CD updates, we make use of TensorFlow 2's eager execution and the GradientTape API you saw in Chapter 3, Building Blocks of Deep Neural Networks:

def cd_update(self, x):

    with tf.GradientTape(watch_accessed_variables=False) as g:

        h_sample = self.sample_h(x)

        for step in range(self.cd_steps):

            v_sample = tf.constant(self.sample_v(h_sample))

            h_sample = self.sample_h(v_sample)

        g.watch(self.w_rec)

        g.watch(self.hb)

        g.watch(self.vb)

        cost = tf.reduce_mean(self.free_energy(x)) - tf.reduce_mean(self.free_energy(v_sample))

        w_grad, hb_grad, vb_grad = g.gradient(cost, [self.w_rec, self.hb, self.vb])

        self.w_rec.assign_sub(self.learning_rate * w_grad)

        self.w_gen = tf.Variable(tf.transpose(self.w_rec) #force tieing

        self.hb.assign_sub(self.learning_rate * hb_grad) 

        self.vb.assign_sub(self.learning_rate * vb_grad)

        return self.reconstruction_cost(x).numpy()

We perform one or more sample steps, and compute the cost using the differece between the free energy of the data and the reconstructed data(which is cast as a constant using tf.constant so that we don't treat it as a variable during autogradient calculation). We then compute the gradients of the three weight matrices and update their values, before returning our reconstruction cost as a way to monitor progress.

The reconstruction cost is simply the cross-entropy loss between the input and reconstructed data:

def reconstruction_cost(self, x):

    return tf.reduce_mean(

        tf.reduce_sum(tf.math.add(tf.math.multiply(x,tf.math.log(self.reverse(self,forward(x))),

        tf.math.multiply(tf.math.subtract(1,x),tf.math.log(

            tf.math.subtract(1,self.reverse(self.forwar(x))))),1),)

which represents the formula:

where y is the target label, y-hat is the estimated label from the sotmax function, and N is the number of elements in  the dataset.

Note that we enforce the weights being equal by copying over the transposed value of the updated (recognition) weights into the generative weights. Keeping the two sets of weights separate will be useful later on when we perform updates only on the recognition (forward) or generative (backward) weights during the wake-sleep procedure.

Putting it all together, we can initialize an RBM with 500 units like in Hinton's paper24, call build() with the shape of the flattened MNIST digits, and run successive epochs of training:

rbm = RBM(500)

rbm.build([784])

num_epochs=100

def train_rbm(rbm=None, data=mnist_train, map_fn=flatten_image, num_epochs= 1000, tolerance=1e-3, batch_size=32, shuffle_buffer=1024):

    last_cost = None

    for epoch in range(num_epochs):

        cost = 0.0

        count = 0.0

        for datapoints in data.map(map_fn).shuffle(shuffle_buffer).batch(batch_size):

            cost += rbm.cd_update)datapoints)

            count += 1.0

        cost /= count

        print("epoch: {}, cost: {}".format(epoch, cost))

        if last_cost and abs(last_cost-cost) <= tolerance:

            break;

        last_cost = cost

    return rbm

rbm = train_rbm(rbm, mnist_train, partial(flatten_iamge, label=False), 100, 0.5, 2000)

After -25 steps, the model should converge, and we can inspect the reults. One parameter of interest is the weight matrix w; the shape is 784(28*28) by 500, so we could see each "column" as a 28*28 filter, similar to the kernels in the convolutional networks we studied in Chapter 3, Building Blocks of Deep Neural Networks. We can visualize a few of these to see what kinds of patterns they are recognizing in the images:

    fig, axarr = plt.subplots(10,10)

    plt.axis('off')

    

    for i in range(10):

        for j in range(10):

            fig.axes[i*10+j].get_xaxis().set_visible(False)

            fig.axes[i*10+j].get_yaxis().set_visible(False)

            axarr[i,j].imshow(rbm.w_rec.numpy()[:, i*10+j].reshape(28,28), cmap=plt.get_camp("gray"))

This provides a set of filters:

We can see that these filters appear to represent different shapes that we would find in a digit image, such as curves or lines, We can also observe the reconstruction of the images by sampling from out data:

i=0

for image, label in mnist_train.map(flatten_image).batch(1).take(10):

    plt.figure(i)

    plt.imshow(rbm.forward_gibbs(image).numpy().reshape(28,28).astype(np.float32), cmap=plt.get_camp("gray"))

    i+=1

    plt.figure(i)

    plt.imshow(image.numpy().reshape(28,28).astype(np.float32), cmap=plt.get_cmap("gray"))

    i+=1

We can see in Figure 4.11 that the network has nicely captured the underlying data distribution, as our samples represent a recognizable binary form of the input images. Now that we have one layer working, let's continue by combining multiple RBMs in layers to create a more powerful model.




    

        


2022년 3월 8일 화요일

Stacking Restricted Boltzmann Machines to generate images: the Deep Belief Network

 You have seen that an RBM with a single hidden layer can be used to learn a generative model of images; in fact, theoretical work has suggested that with a sufficiently large number of hidden units, an RDB can approximate any distribution with binary values. However, in practice, for very laarge input data, it may be more efficient to add additiional layers, instead of single large layer, allowing a more "compact" representation of the data.

Researchers who developed DBNs  also noted that adding additional layers can only lower the log likehoood of the lower bound of the approximation of the data reconstructed by the generative model. In this case, the hidden layer output h of the first layer becomes the input to a second RDB; we can keep adding other layers to make a deeper network. Furthermore, if we wanted to make to make this network capable of learning not only the distribution of the image(x) but also the label-which digit it represent from 0 to 9(y) - we could add yet another layer to a stack of connected RBMs that is a probability distribution (softmax) over the 10 possible digit classes.

A problem with training a very deep graphical model such as stacked RDBs is the "explaining-away effect" that we discussed in Chapter 3, Building Blocks of Deep Neural Networks. Recall that the dependency between variables can complicate inference of the state of hidden variables:

In Figure 4.8, the know ledge that the pavement is wet can be explained by a sprinkler being turned on, to the extent that the presence of absence of rain becomes irrelevant, meaning we can't meaningfully infer the probability that it is raining, This is equivalent to saying that the posterior distribution (Chapter 1, An Introduction to Generative AI: "Drawing" Data from Models) of the hidden units cannot be tractably computed, since they are correlated, which interferes with easily sampling the hidden state of the RPM.

One solution is to treat each of the units as independent in the likelihood function, which is known as variational inference; while this works in paractice, it is not a satisfying solution given that we know that these units are in fact correlated.

But where does this correlation come from? If we sample the state of the visible units in a single-layer RBM, we set the states of each hidden unit randomly since they are independent; thus the prior distribution over the hidden units is independent. Why is the posterior the correlated? Just as the knowledge(data) that the pavement is wet causes a correlation between the probabilities of a sprinkler and rainy weather, the correlation between pixel values causes the posterior distribution of the hidden units to be non-independent. This is because the pixels in the images aren't set randomly; based on which digit the image represents, groups of pixels are more a less likely to be bright of dark. In the 2006 paper A Fast Learning Algorithum for Deep Belif Nets, the authors hypothesized that this problem could be solved by computing a complementary prior that has exactly the opposite correlation to the likelihood, thus canceling out this dependence and making the posterior also independent.

To compute this complementary prior, we could use the posterior distribution over hidden units in a higher layer. The trick to generating such distributions is in a greedy, layer-wise procedure for "priming" the network of stacked RBMs in a multilayer generative model, such that the weights can then be fine-tuned as a classification model. For example, let's consider a three-layer model for the MNIST data(Figure 4.9):

The two 500-unit layers form representations of the MNIST digits, while the 2000 and 10-unit layers are "associative memory" that correlates labels with the digit representation. The two first layers have directed connections (different weights) for upsampling and downsampling, while the top layers have undirected weights(the same weight for forward and backward passes).

This model could be learned in stages. For the first 500-unit RBM, we would treat it as an undirected model by enforcing that the forward and backward weights are equal; we would then use CD to learn the parameters of this RBM. We would then fix these weights and learn a second(500-unit) RBM that uses the hidden units from the first layer as input "data," and repeat for the 2000-layer unit.

After we have "primed" the network, we no longer need to enforce that the weights in the bottom layers are tied, and con fine-tune the weights using an algorithm known as "wake-sleep."

Firstly, we take input data(the digits) and compute the activations of the other layers all the way up until the connections between the 2000-and 10-unit layers. We compute updates to the "generative weights" (those that compute the activations that yield image data from the network) pointing downward using the previously given gradient equations. This is the "wake" phase because if we consider the network as resembling a biological sensory system, then it receives input from the environment through this forward pass.

For the 2000-and 10-unit layers, we use the sampling procedure for CD using the second 500-unit layer's output as "data" to update the undirected weights.

We then take the output of the 2000-layer unit and compute activations downward, updating the "recognition weights" (those that compute activations that lead to the classification of the image into one of the digit classes) pointing upward. This is called the "sleep" phase because it displays what is in the "memory" of the network, rather than taking data from outside.

We then repeat these steps until convergence.

Note that in practice, instead of using undirected weights in the top layers of the network, we could replace the last layer with directed connections and a softmax classifier. This network would then technically no longer be a DBN, but rather a regular Deep Neural Network that we could optimize with backpropagation. This is an approach we will take in our own code, as we can then leverage TensorFlow's built-in gradient calculations, and it fits into the paradigm of the Model API.

Now that we have covered the theoretical background to understand how a DBN is trained and how the pre-training approach resolves issues with the "explaining-away" effect, we will implement the whole model in code, showing how we can leverage TensorFlow 2's gradient taps functionality to implement CD as a custom learning algorithm.



2022년 3월 5일 토요일

Contrastive divergence: Approximating a gradient

 If we refer back to Chapter 1, An IOntroduction to Generative AI: "Drawing" Data from Models, creating a generative model of images using an RBM essentially involves finding the probability ddistribution of images, using the energy equation

wehre x is an image, theta is the paramethers of the model(the weights and biases), and Z is the partition function;

In oder to find the parameters that optimaize this distribution, we need to maximize the likelihood(product of each datapoint's probability under a density function) based on data:

In practice, it's bit easier to use the negative log likehood, as this is represented by a sum:

If the distribution f has a simple form, then we can just take the derivative of E with respect to parameters of f. For example, if f is a single normal distribution, then the values that maximize E with respect to mu(the mean)and sigma(the standard deviation) are, respectively, the sample mean and standard deviation; the partition function Z doesn't effect this calculation because the integral is 1, a constant, which becomes 0 once we take the logarithm.

If the distribution is instead a sum of N normal distributions, then the partial derivative of mu(i)(one of these distributions) with respect to f(the sum of all the N normal distributions) involves the mu and sigma of each other distribution as well. Because of this dependence, there is no closed-form solution(for example, a solution equation we can write out by rearranging terms or applying algebraic trasformations) for the optimal value; instead, we need to use a gradient search method (such as the backpropagation algorithm we discussed in Chapter 3, Building Blocks of Deep Neural Networks) to iteratively find the optimal value of this function. Again, the inegral of each of these N distributions is 1, meaning the partition function is the constant log(N), making the derivative 0.

What happens if the distribution f is a product, instead of a sum, of normal distributions? The partition function Z is now no longer a constatn in this equation with respect to theta, the parameters; the value will depend on how and where these function overlap when computing the integral-they could cancel each other out by being mutually exclusive(0) or overlapping(yielding a value greater than 1). In order to evaluate gradient descent steps, we would need to be able to compute this partition function using numerical methods. In the RBM example, this partition function for the configuration of 28 * 28 MNIST digits would have 784 logistic units, and a massive number (2786) of possible configurations, making it unwieldy to evaluate every time w ant to take a gradient.

Is there any other way we could optimize the value of this energy equation without taking a full gradient? Returning to the energy equation, let's write out the gradient explicitly:

The partition function Z can be further written as a function of the integral involving X and the parameters of f;

where <> represents an average over the observed data sampled from the distribution of x. In order words, we can approximate the integral by sampling from the data and computing the average, which allows us to avoid computing or approximating high-dimensional integrals.

While we can't directly sample from p(x), we can use a technique known as MarkovChain Monte Carlo(MCMC) sampling to generate data from the target distribution p(x). As was described in our discussion on Hopfield networks, the "Markov" property means that this sampling only uses the last sample as input in determining the probability of the next datapoint in the simulation-this forms a "chain" in which each successive sampled datapoint becomes input to the next.

The "Monte Carlo" in the name of this technique is a reference to a casino in the principality of Monaco, and denotes that, like the outcomes of gambling, these samples are generated through a random process. By generating these random samples, you can use N MCMC steps as an approximation of the average of a distribution that is otherwise difficult or impossible to integrate. When we put all of this together, we get the following gradient equation:

where X represents the data at each step in the MCMC chain, with X being the input data. While in theory you might think it would take a large number of steps for the chain to converge, in practice it has been observed that even N=1 steps is enough to get a decent gradient approximation.

Notice that the end result is a contrast between the input data and the sampled data; thus, the method is named contrastive divergence as it involves the difference between two distributions.

Applying this to our RBM example, we can follow this recipe to generate the required samples:

1. Take an input vector v

2. Compute a "hidden" activation h

3. Use the activation from to generate a sampled visible state v

4. Use to generate a sampled hidden state h

5. Compute the updates, which are simply the correlations of the visible and hidden uinits:


where b and c are the bias terms of visible and hidden units, respectively, and e is the learning rate.

This sampling is known as Gibbs sampling, a method in which we sample one unknown parameter of a distribution at a time while holding all others constant. Here we hold the visible or the hidden fixed and sample units in each step. With CD, we now have a way to perform gradient descent to learn the parameters of our RDB model; as it turns out, we can potentially compute an even better model by stacking RDBs in what is called a DBN.






Modeling data with uncertainty with Restricted Boltmann Machines

 What other kinds of distributions might we be interested in? While useful from a theoretical perspective, one of the shortcomings of the Hopfield network is that it can't incorporate the kinds of uncertainty seen in actual physical or biological systems, rather than deterministically turning on or off, real-world problems often involve an element of chance - a magnet might flip polarity, or a neuron might fire at random.

This uncertainty, or stochasticity, is reflected in the Boltzmann machine, a variant of the Hopfield network in which half the neurons (the "visible" units) receive information from the environment, while half(the "hidden" units) only receive information from the visible units.

The Boltzmann machine randomly turns on(1) or off(0) each neuron by sampling, and over many iterations converges to a stable state represented by the minima of the energy function. This is shown schematically in Figure 4.6, in which the white nodes of the network are "off," and the blue ones are "on," if we were to simulate the activations in the network, these values would fluctuate over time.

In theory, a model like this could be used, for example, to model the distribution of images, such as the MNIST data using the hidden nodes as a "barcode" that represents an underlying probability model for "activation" each pixel in the image. In Practice, though, there are problems with this approach. Firstly, as the number of units in the Boltzmann network increases, the number of connections increases exponentially (for example, the number of potential configurations that has to be accounted for in the Gibbs measure's normalization constant explodes), as does the time needed to sample the network to an equilibrium state. Secondly, weights for units with intermediate activate probabilities(not strongly 0 or 1) will tend to fluctuate in a random walk pattern (for example, the probabilities will increase or decrease randomly but never stabilize to an equilibrium value) until the neurons converge, which also prolongs training.

A practical modification is to remove some of the connections in the Boltzmann machine, namely those between visible units and between hidden units, leaving only connections between the two types of neurons. This modification is known as the RBM, shown in Figure 4.7.

Imagine as described earlier that the visible units are input pixels from the MNIST dataset, and the hidden units are an encoded representation of that image. By sampling back and forth to convergence, we could create a generative model for images. We would just need a learning rule that would tell us how to update the weights to allo the energy function to converge to its minimum; this algorithm is contrastive divergence(CD). To understand why we need a special algorithm for RBMs, it helps to revisit the energy equation and how we might sample to get equilibrium for the network.