페이지

2022년 3월 4일 금요일

Restricted Bolzmann Machines: generating pixels with statistical mechanics

 The neural network model that we will apply to the MNIST data has its origins in earlier research on how neurons in the mammalian brain might work together to transmit signals and encode patterns as memories. By using analogies to statistical mechanics in physics, this section will show you how simple networks can "learn" the distribution of image data and be used as building blocks for larger networks.

Retrieving and loading the MNIST dataset in TensorFlow

 The first step in training our own DBN is to construct our dataset. This section will show you how to transform the MNIST data into a convenient format that allows you to train a neural network, using some of TensorFlow 2's built-in functions for simplicity.

Let's start by loading the MNIST dataset in TensorFlow. As the MNIST data has been used for many deep learning benchmarks, TensorFlow2 already has convenient utilities for loading and formatting this data. To doo so, we need to first install the tensorflow-dataset library;

pip install tensorflow-datasets

After installing the package, we need to import it along with the required dependencies:

from __future__ import absolute_import

from __future__ import division

from __future__ import print_function

import matplotlib.phlot as plt

import numpy as np

import tesorflow.compat.v2 as tf

import tensorflow_datasets as tfds


Now we can download the MNIST data locally from Google Cloud Storage(GCS) using the builder functionality:

mnist_builder = tfds.builder("mnist")

mnist_builder.download_and_prepare()

The dataset will now be available on disk on our machine. As noted earlier, this data is divided into a training and test dataset, which you can verify by taking a look at the info command:

info = mnist_builder.info

print(info)

This gives the following output:

tfds.core.DatasetInfo(

    name='mnist',

    version=3.0.1

    description='The MNIST database of handwitten digits.',

    homepage='http://yann.lecun.com/exdb/mnist/',

    features=FeaturesDict({

        'image': Image(shape=(28, 28, 1), dtype=tf.unit8),

        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),

    }),

    total_num_examples=70000,

    splits={

        'test':10000,

        'train':60000,

    },

    supervised_keys=('image', 'label'),

    citation="""@article{lecun2010mnist,

        title={MNIST handwritten digit database},

        author={LeCun, Yann and Cortes, Corinna and Burges, CJ},

        journal={ATT Labs [Online], Available: http://yann.lecun.com/exdb/mnist},

        volume={2},

        year={2010}

        }""",

    redistribution_info=,

)

As you can see, the test dataset has 10,000 examples, the training dataset has 60,000 examples, and the images are 28*28 pixels with a label from one of 10 classes (0 to 9).

Let's start by taking at the training dataset:

mnist_train = mnist_builder.as_dataset(split="train")

We can visually plot some examples using the show_examples function:

fig = tds.show_examples(info, mnist_train)

This gives the following figure:

You can also see more clearly here the grayscale edges on the numbers where the anti-aliasing was applied to the original dataset to make the edges seem less jagged (the colors have also been flipped from the original example in Figure 4.1).

We can also plot an individual image by taking one element from the dataset, reshaping it to a 28*28 array, casting it as a 32-bit float, and plotting it in grayscale:

flatten_image = partial(flatten_image, label=True)

for image, label in mnist_train.map(flatten_image).take(1):

    plt.imshow(image.numpy().reshape(28,28).astype(np.float32),cmap=plt.get_cmap("gray"))

    print("Label: %d" % label.numpy())

This gives the following figure:

This in nice for visual inspection, but for our experiments in this chapter, we will actually need to flatten these images into a vector. To do so, we can use the map() function, and verify that the dataset is now flattened; note that we also need to cast to a float for use in the RBM later. The RBM also assumes binary (0 or1) inputs, so we need to rescale the pixels, which range from 0 to 256 to the range 0 to 1:

def flatten_image(x, label=True):

    if label:

        return (tf.divide(tf.dtypes.cast(tf.reshape(x["image"],(1,28*28)), tf.float32), 256.0), x["label"])

    else:

        return (tf.divide(tf.dtypes.cast(tf.reshape(x["image"],(1,28*28)), tf.float32), 256.0))

for image, label in mnist_train.map(flatten_image).take(1):

    plt.imshow(image.numpy().astype(np.float32), cmap=plt.get_cmap("gray"))

    print("Label: %d" % label.numpy())

This gives a 784*1 vector, which is the "flattened" version of the pixels of the digit "4":

Now that we have the MNIST data as a series of vectors, we are ready to start implementing an RBM to process this data and ultimately create a model capable of generating new images.


The MNIST database

 In developing the DBN model, we will use a dataset that we have discussed before - the MNIST database, which contains digital images of hand-drawn digits from 0 to 9. This database is combination of two sets of earlier images from the National Institute of Standards and Technology(NIST): Special Database 1(digits written by US high school students) and Special Database 3(written by US Census Bureau employees), the sum of which is split into 60,000 training images and 10,000 test images.

The original images in the dataset were all block and white, while the modified dataset normalized them to fit into a 20*20-pixel bounding box and removed jagged edges using anti-aliasing, leading to intermediary grayscale values in cleaned images; they are padded for a final resolution of 28*28 pixels.

In the original NIST dataset, all the training images came from bureau employees, while the test dataset came from high school students, and the modified version mixes the two groups in the training and test sets to provide a less biased population for training machine learning algorithms.

An early application of Support Vector Machines(SMVs) to this dataset yielded an error rate of 0.8%, while the latest deep learning models have shown error rates as low as 0.23%. You should note that these figures were obtained due to not only the discrimination algorithms used but also "data augmentation" tricks such as creating additional translated images where the digit has been shifted by several pixels, thus increasing the number of data examples for the algorithm to learn from. Because of its wide availability, this dataset has become a benchmark for many machine learning models, including Deep Neural Networks.

The dataset was also the benchmark for a breakthrough in training multi-layer neural networks in 2006, in which an error rate of 1.25% was achieved(without image translation, as in the preceding examples). In this chapter, we will example in detail how this breakthrough was achieved using a generative model, and explore how to build our own DBN that can generate MNIST digits.

4. Teaching Networks to Generate Digits

 In the previous chapter, we covered the building blocks of neural network models. In this chapter, our first project will recreate one of the most groundbreaking models in the history of deep learning, Deep Belief Network(DBN). DBN was one of the first multi-layer networks for which a feasible learning algorithm was developed. Besides being of historical interest, this model is connected to the topic of this book because the learning algorithm makes use of a generative model in order to pre-train the neural network weights into a reasonable configuration prior to backprogagation.

In this chapter, we will cover:

- How to load the Modified National Institute of Standards and Technology(MNIST) dataset and transform it using TensorFlow 2's Dataset API.

- How a Restricted Boltzmann Machine(RBM) - a simple neural network - is trained by minimizing an "energe" equation that resembles formulas from physics to generate images.

- How to stack several RBMs to make a DBN and apply forward and backward passes to pre-train this network to generate image data.

- How to implement an end-to-end classifier by combining this pre-training with backpropagation "fine-tuning" using the TensorFlow 2 API.


2022년 3월 1일 화요일

Summary

 In this chapter, we've covered the basic vocabulary of deep learning - how initial research into perceptrons and LMPs led to simple learning rules being abandoned for backpropagation. We also looked at specialized neural network architectures such as CNNs, based on the visual cortex, and recurrent networks, specialized for sqequence modeling. Finally, we examined variants of the gradient descent algorithm proposed originally for backpropagation, which have advantages such as momentum, and described weight initialization schemes that place the parameters of the network in a range that is easier to navigate to local minimum.

With this context in place, we are all set to dive into projects in generative modeling, beginning with the generation of MNIST digit using Deep Belief Networks in Chapter 4, Teaching Networks to Generate Digits.


Xavier initialization

 As noted previously, in earlier research it was common to initialize weights in a neural network with some range of random values. Breakthroughs in the training of Deep Belief Networks in 2006, as you will see in Chapter 4, Teaching Networks to Generate Digits, used pre-training (through a generative modeling approach) to initialize weights before performing standard backpropagation.

If you've ever used a layer in the TensorFlow Keras module, you will notice that the default initialization for layer weights draws from either a truncated normal or uniform distribution. Where does this choice come from? As I described previously, one of the challenges with deep networks using sigmoidal or hyperbolic activation functions is that they tend to become saturated, since the values for these functions are capped with very large or negative input. We might interpret the challenge of initializing networks then as keeping weights in such a range that they don't saturate the neuron's output. Another way to understand this is to assume that the input and output values of the neuron have similar variance; the signal is not massively amplifying or diminishing while passing through the neuron.

In practice, for a linear neuron, y = ws + h, we could compute the variance of the input and output as:

var(y) = var(ws + b)

The b is constant, so we are left with:

var(y) = var(w)var(x) +  var(x)E(x)2 + var(x)E(w)2 = var(w)var(x)

Since there are N elements in the weight matrix, and we want var(y) to equal var(x), this gives:

1 = Nvar(x),var(w) = 1/N

Therefore, for a weight matrix w, we can use a truncated normal or uniform distribution with variance 1/N (the average number of input and output units, so the number of weights). Variations have also been applied to ReLU units. these methods are referred to by their original authors' names as Xavier or He initialization.

In summary, we've reviewed several common optimizers used under the hood in TensorFlow2, and discussed how they improve upon the basic form of SGD. We've also discussed how clever weight initialization schemes work together with these optimizers to allow us to train ever more complex models.


Gradient descent to ADAM

 As we saw in our discussion of back propagation, the original version proposed in 1986 for training neural networks averaged the loss over the entire dataset before taking the gradient and updating the weights. Obviously, this is quite slow and makes distributing the model difficult, as we can't split up the input data and model replicas; if we use them, each needs to have access to the whole dataset.

In contrast, SGD computes gradient updates after n samples, where n could a range from 1 to N, the size of the dataset. In practice, we usually perform mini-batch gradient descent, in which n is relatively small, and we randomize assignment of data to the n batches after each epoch (a single pass through the data).

However, SGD can be slow, leading researchers to propose alternatives that accelerate the each for a minimum. As seen in the original backpropagation algorithm, one idea is to use a form of exponentially weighted mementum that remembers prior steps and continues in promising directions. Variants have been proposed, such as Nesterou Momentum, which adds a term to increase this acceleration.

In comparison to the momentum term used in the original backpropagation algorithm, the addition of the current momentum term to the gradient helps keep the momentum component aligned with the gradient changes.

Another optimization, termed Adaptive Gradient(Adagrad), scales the learning rate for each update by the running the sum of squares(G) of the gradient of that parameter; thus, elements that are frequently updated are downsampled, while those that are infrequently updated are pushed to update with greater magnitude:

This approach has the downside that as we continue to train the neural network, the sum G will increase inndefinitely, ultimately sharinking the learning rate to a very samll value. To fix this shortcoming, two variant methods, RMSProp (frequently applied to RNNs) and AdaDelta impose fixed-width windows of n steps in the computation of G.


Adaptive Momentum Estimation (ADAM) can be seen as an attempt to combine momentum and AdaDelta; the momentum claculation is used to preserve the history of past gradient updates, while the sum of decaying squared gradients within a fixed update window used in AdaDelta is applied to scale the resulting gradient.

The methods mentioned here all share the property of being first order; they involve only the first derivative of the loss with respect to the input. While simple to compute, this may introduce partical challenges with navigating the complex solution space of neural network parameters. As shown in Figure 3.17, if we visualize the landscape of weight parameters as a ravine, then first-order methods will either move too quickly in areas in which the curvature is changing quickly (the top image) overshooting the minima, or will change too slowly within the minima "ravine," wher ethe curvature is low. An ideal algorithm would take into account not only the curvature but the rate of change of the curvature, allowing an optimizer order method to take larger step sizes when the curvature changes very slowly and vice versa(the bottom image).

Because they make use of the rate of change of the derivative (the second derivative), these methods are known as second order, and have demonstrated some success in optimizing neural network models.

However, the computation required for each update is larger than for first-order methods, and because most second-order methods iunvolve large matrix inversions (and thus memory utilization), approximations are required to make these methods scale. Ultimately, however, one of the breakthroughs in practically optimizing networks comes not just from the optimization algorithm, but how we initialize the weights in the model.