Limit(0)

2022년 3월 5일 토요일

Contrastive divergence: Approximating a gradient

If we refer back to Chapter 1, An IOntroduction to Generative AI: "Drawing" Data from Models, creating a generative model of images using an RBM essentially involves finding the probability ddistribution of images, using the energy equation

wehre x is an image, theta is the paramethers of the model(the weights and biases), and Z is the partition function;

In oder to find the parameters that optimaize this distribution, we need to maximize the likelihood(product of each datapoint's probability under a density function) based on data:

In practice, it's bit easier to use the negative log likehood, as this is represented by a sum:

If the distribution f has a simple form, then we can just take the derivative of E with respect to parameters of f. For example, if f is a single normal distribution, then the values that maximize E with respect to mu(the mean)and sigma(the standard deviation) are, respectively, the sample mean and standard deviation; the partition function Z doesn't effect this calculation because the integral is 1, a constant, which becomes 0 once we take the logarithm.

If the distribution is instead a sum of N normal distributions, then the partial derivative of mu(i)(one of these distributions) with respect to f(the sum of all the N normal distributions) involves the mu and sigma of each other distribution as well. Because of this dependence, there is no closed-form solution(for example, a solution equation we can write out by rearranging terms or applying algebraic trasformations) for the optimal value; instead, we need to use a gradient search method (such as the backpropagation algorithm we discussed in Chapter 3, Building Blocks of Deep Neural Networks) to iteratively find the optimal value of this function. Again, the inegral of each of these N distributions is 1, meaning the partition function is the constant log(N), making the derivative 0.

What happens if the distribution f is a product, instead of a sum, of normal distributions? The partition function Z is now no longer a constatn in this equation with respect to theta, the parameters; the value will depend on how and where these function overlap when computing the integral-they could cancel each other out by being mutually exclusive(0) or overlapping(yielding a value greater than 1). In order to evaluate gradient descent steps, we would need to be able to compute this partition function using numerical methods. In the RBM example, this partition function for the configuration of 28 * 28 MNIST digits would have 784 logistic units, and a massive number (2786) of possible configurations, making it unwieldy to evaluate every time w ant to take a gradient.

Is there any other way we could optimize the value of this energy equation without taking a full gradient? Returning to the energy equation, let's write out the gradient explicitly:

The partition function Z can be further written as a function of the integral involving X and the parameters of f;

where <> represents an average over the observed data sampled from the distribution of x. In order words, we can approximate the integral by sampling from the data and computing the average, which allows us to avoid computing or approximating high-dimensional integrals.

While we can't directly sample from p(x), we can use a technique known as MarkovChain Monte Carlo(MCMC) sampling to generate data from the target distribution p(x). As was described in our discussion on Hopfield networks, the "Markov" property means that this sampling only uses the last sample as input in determining the probability of the next datapoint in the simulation-this forms a "chain" in which each successive sampled datapoint becomes input to the next.

The "Monte Carlo" in the name of this technique is a reference to a casino in the principality of Monaco, and denotes that, like the outcomes of gambling, these samples are generated through a random process. By generating these random samples, you can use N MCMC steps as an approximation of the average of a distribution that is otherwise difficult or impossible to integrate. When we put all of this together, we get the following gradient equation:

where X represents the data at each step in the MCMC chain, with X being the input data. While in theory you might think it would take a large number of steps for the chain to converge, in practice it has been observed that even N=1 steps is enough to get a decent gradient approximation.

Notice that the end result is a contrast between the input data and the sampled data; thus, the method is named contrastive divergence as it involves the difference between two distributions.

Applying this to our RBM example, we can follow this recipe to generate the required samples:

1. Take an input vector v

2. Compute a "hidden" activation h

3. Use the activation from to generate a sampled visible state v

4. Use to generate a sampled hidden state h

5. Compute the updates, which are simply the correlations of the visible and hidden uinits:

where b and c are the bias terms of visible and hidden units, respectively, and e is the learning rate.

This sampling is known as Gibbs sampling, a method in which we sample one unknown parameter of a distribution at a time while holding all others constant. Here we hold the visible or the hidden fixed and sample units in each step. With CD, we now have a way to perform gradient descent to learn the parameters of our RDB model; as it turns out, we can potentially compute an even better model by stacking RDBs in what is called a DBN.

Modeling data with uncertainty with Restricted Boltmann Machines

What other kinds of distributions might we be interested in? While useful from a theoretical perspective, one of the shortcomings of the Hopfield network is that it can't incorporate the kinds of uncertainty seen in actual physical or biological systems, rather than deterministically turning on or off, real-world problems often involve an element of chance - a magnet might flip polarity, or a neuron might fire at random.

This uncertainty, or stochasticity, is reflected in the Boltzmann machine, a variant of the Hopfield network in which half the neurons (the "visible" units) receive information from the environment, while half(the "hidden" units) only receive information from the visible units.

The Boltzmann machine randomly turns on(1) or off(0) each neuron by sampling, and over many iterations converges to a stable state represented by the minima of the energy function. This is shown schematically in Figure 4.6, in which the white nodes of the network are "off," and the blue ones are "on," if we were to simulate the activations in the network, these values would fluctuate over time.

In theory, a model like this could be used, for example, to model the distribution of images, such as the MNIST data using the hidden nodes as a "barcode" that represents an underlying probability model for "activation" each pixel in the image. In Practice, though, there are problems with this approach. Firstly, as the number of units in the Boltzmann network increases, the number of connections increases exponentially (for example, the number of potential configurations that has to be accounted for in the Gibbs measure's normalization constant explodes), as does the time needed to sample the network to an equilibrium state. Secondly, weights for units with intermediate activate probabilities(not strongly 0 or 1) will tend to fluctuate in a random walk pattern (for example, the probabilities will increase or decrease randomly but never stabilize to an equilibrium value) until the neurons converge, which also prolongs training.

A practical modification is to remove some of the connections in the Boltzmann machine, namely those between visible units and between hidden units, leaving only connections between the two types of neurons. This modification is known as the RBM, shown in Figure 4.7.

Imagine as described earlier that the visible units are input pixels from the MNIST dataset, and the hidden units are an encoded representation of that image. By sampling back and forth to convergence, we could create a generative model for images. We would just need a learning rule that would tell us how to update the weights to allo the energy function to converge to its minimum; this algorithm is contrastive divergence(CD). To understand why we need a special algorithm for RBMs, it helps to revisit the energy equation and how we might sample to get equilibrium for the network.

2022년 3월 4일 금요일

Hopfield networks and energy equations for neural networks

As we discussed in Chapter 3, Building Blocks of Deep Neural Networks, Hebbian Learning states, "Neurons that fire together, wire together", "and many models, including the multi-layer perceptron, made use of this idea in order to develop learning rules. One of these models was the Hopfield network, developed in the 1970-80s by several researchers. In this network, each "neuron" is connected to every other by a symmetric weight, but no self-connections (there are only connections between neurons, no self-loops).

Unlike the multi-layer perceptrons and other architectures we studied in Chapter 3, Building Blocks of Deep Neural Networks, the Hopfield network is an undirected graph, since the edges go "both ways."

The neurons in the Hopfield network take on binary values, either (-1, 1) or (0, 1), as a thresholded version of the tanh or sigmoidal activation function:

The threshold values (sigma) never change during training; to update the weights, a "Hebbian" approach is to use a set of n binary patterns (configurations of all the neurons) and update as:

where n is the number of patterns, and e is the binary activations of neurons i and j in a particular configuration. Looking at this equation, you can see that if the neurons share a configuration, the connection between them is strengthened, while if they are opposite signs (one neuron has a sign of +1, the other -1), it is weakened. Following this rule to iteratively strengthen or weaken a connection leads the network to converge to a stable configuration that resembles a "memory" for a particular activation of the network, given some input. This represents a model for associative memory in biological organisms- the kind of memory that links unrelated ideas, just as the neurons in the Hopifield network are linked together.

Besides representing biological memory, Hopfield networks also have an interesting parallel to electromagnetism. If we consider each neuron as a particle or "charge," we can describe the model in terms of a "free energy" equation that represents how the particles in this system mutually repulse/attract each other and where on the distribution of potential configurations the system lies relative to equilibrium:

where we is the weights between neurons i and j, s is the "states" of those neurons (either 1, "on," or -1, "off"), and sigma is the threshold of each neuron (for example, the value that its total inputs must exceed to set it to "on"). When the Hopfield network is in its final configuration, it also minimizes the value of the energy function computed for the network, which is lowered by units with an identical state(s) being connected strongly(w). The probability associated with a particular configuration is given by the Gibbs measuer:

Here, Z(B) is a normalizing constant that represents all possible configurations of the newtwork, in the same respect as the normalizing constant in the Bayesian probability function you saw in Chapter 1, An Introduction to Generative AI: "Drawing" Data from Model.

Also notice in the energy function definition that the state of neuron is only affected by local connections (rather than the state of every other neuron in the network, regardless of if it is connected); this is also known as the Markov property, since the state is "memoryless," depending only on its immediate "past" (neighbors). In fact, the Hammersly-Clifford therem states that any distribution having this same memoryless property can be represented using the Gibbs measure.

Restricted Bolzmann Machines: generating pixels with statistical mechanics

The neural network model that we will apply to the MNIST data has its origins in earlier research on how neurons in the mammalian brain might work together to transmit signals and encode patterns as memories. By using analogies to statistical mechanics in physics, this section will show you how simple networks can "learn" the distribution of image data and be used as building blocks for larger networks.

Retrieving and loading the MNIST dataset in TensorFlow

The first step in training our own DBN is to construct our dataset. This section will show you how to transform the MNIST data into a convenient format that allows you to train a neural network, using some of TensorFlow 2's built-in functions for simplicity.

Let's start by loading the MNIST dataset in TensorFlow. As the MNIST data has been used for many deep learning benchmarks, TensorFlow2 already has convenient utilities for loading and formatting this data. To doo so, we need to first install the tensorflow-dataset library;

pip install tensorflow-datasets

After installing the package, we need to import it along with the required dependencies:

from __future__ import absolute_import

from __future__ import division

from __future__ import print_function

import matplotlib.phlot as plt

import numpy as np

import tesorflow.compat.v2 as tf

import tensorflow_datasets as tfds

Now we can download the MNIST data locally from Google Cloud Storage(GCS) using the builder functionality:

mnist_builder = tfds.builder("mnist")

mnist_builder.download_and_prepare()

The dataset will now be available on disk on our machine. As noted earlier, this data is divided into a training and test dataset, which you can verify by taking a look at the info command:

info = mnist_builder.info

print(info)

This gives the following output:

tfds.core.DatasetInfo(

name='mnist',

version=3.0.1

description='The MNIST database of handwitten digits.',

homepage='http://yann.lecun.com/exdb/mnist/',

features=FeaturesDict({

'image': Image(shape=(28, 28, 1), dtype=tf.unit8),

'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),

}),

total_num_examples=70000,

splits={

'test':10000,

'train':60000,

supervised_keys=('image', 'label'),

citation="""@article{lecun2010mnist,

title={MNIST handwritten digit database},

author={LeCun, Yann and Cortes, Corinna and Burges, CJ},

journal={ATT Labs [Online], Available: http://yann.lecun.com/exdb/mnist},

volume={2},

year={2010}

}""",

redistribution_info=,

)

As you can see, the test dataset has 10,000 examples, the training dataset has 60,000 examples, and the images are 28*28 pixels with a label from one of 10 classes (0 to 9).

Let's start by taking at the training dataset:

mnist_train = mnist_builder.as_dataset(split="train")

We can visually plot some examples using the show_examples function:

fig = tds.show_examples(info, mnist_train)

This gives the following figure:

You can also see more clearly here the grayscale edges on the numbers where the anti-aliasing was applied to the original dataset to make the edges seem less jagged (the colors have also been flipped from the original example in Figure 4.1).

We can also plot an individual image by taking one element from the dataset, reshaping it to a 28*28 array, casting it as a 32-bit float, and plotting it in grayscale:

flatten_image = partial(flatten_image, label=True)

for image, label in mnist_train.map(flatten_image).take(1):

plt.imshow(image.numpy().reshape(28,28).astype(np.float32),cmap=plt.get_cmap("gray"))

print("Label: %d" % label.numpy())

This gives the following figure:

This in nice for visual inspection, but for our experiments in this chapter, we will actually need to flatten these images into a vector. To do so, we can use the map() function, and verify that the dataset is now flattened; note that we also need to cast to a float for use in the RBM later. The RBM also assumes binary (0 or1) inputs, so we need to rescale the pixels, which range from 0 to 256 to the range 0 to 1:

def flatten_image(x, label=True):

if label:

return (tf.divide(tf.dtypes.cast(tf.reshape(x["image"],(1,28*28)), tf.float32), 256.0), x["label"])

else:

return (tf.divide(tf.dtypes.cast(tf.reshape(x["image"],(1,28*28)), tf.float32), 256.0))

for image, label in mnist_train.map(flatten_image).take(1):

plt.imshow(image.numpy().astype(np.float32), cmap=plt.get_cmap("gray"))

print("Label: %d" % label.numpy())

This gives a 784*1 vector, which is the "flattened" version of the pixels of the digit "4":

Now that we have the MNIST data as a series of vectors, we are ready to start implementing an RBM to process this data and ultimately create a model capable of generating new images.

The MNIST database

In developing the DBN model, we will use a dataset that we have discussed before - the MNIST database, which contains digital images of hand-drawn digits from 0 to 9. This database is combination of two sets of earlier images from the National Institute of Standards and Technology(NIST): Special Database 1(digits written by US high school students) and Special Database 3(written by US Census Bureau employees), the sum of which is split into 60,000 training images and 10,000 test images.

The original images in the dataset were all block and white, while the modified dataset normalized them to fit into a 20*20-pixel bounding box and removed jagged edges using anti-aliasing, leading to intermediary grayscale values in cleaned images; they are padded for a final resolution of 28*28 pixels.

In the original NIST dataset, all the training images came from bureau employees, while the test dataset came from high school students, and the modified version mixes the two groups in the training and test sets to provide a less biased population for training machine learning algorithms.

An early application of Support Vector Machines(SMVs) to this dataset yielded an error rate of 0.8%, while the latest deep learning models have shown error rates as low as 0.23%. You should note that these figures were obtained due to not only the discrimination algorithms used but also "data augmentation" tricks such as creating additional translated images where the digit has been shifted by several pixels, thus increasing the number of data examples for the algorithm to learn from. Because of its wide availability, this dataset has become a benchmark for many machine learning models, including Deep Neural Networks.

The dataset was also the benchmark for a breakthrough in training multi-layer neural networks in 2006, in which an error rate of 1.25% was achieved(without image translation, as in the preceding examples). In this chapter, we will example in detail how this breakthrough was achieved using a generative model, and explore how to build our own DBN that can generate MNIST digits.

4. Teaching Networks to Generate Digits

In the previous chapter, we covered the building blocks of neural network models. In this chapter, our first project will recreate one of the most groundbreaking models in the history of deep learning, Deep Belief Network(DBN). DBN was one of the first multi-layer networks for which a feasible learning algorithm was developed. Besides being of historical interest, this model is connected to the topic of this book because the learning algorithm makes use of a generative model in order to pre-train the neural network weights into a reasonable configuration prior to backprogagation.

In this chapter, we will cover:

- How to load the Modified National Institute of Standards and Technology(MNIST) dataset and transform it using TensorFlow 2's Dataset API.

- How a Restricted Boltzmann Machine(RBM) - a simple neural network - is trained by minimizing an "energe" equation that resembles formulas from physics to generate images.

- How to stack several RBMs to make a DBN and apply forward and backward passes to pre-train this network to generate image data.

- How to implement an end-to-end classifier by combining this pre-training with backpropagation "fine-tuning" using the TensorFlow 2 API.