페이지

2022년 2월 28일 월요일

Building a better optimizer

 In this chapter we have so far discussed several examples in which better neural network architectures allowed for breakthroughs; however, just as(and perhaps even more) important is the optimization procedure used to minimize the error function in these problems, which "learns" the parameters of the network by selecting those that yield the lowest error. Referring to our discussion of backpropagation, this problem has two components;

- How to initialize the weights: In many applications historically, we see that the authors used random weights within some range, and hoped that the use of backpropagation would result in at least a locally minimal loss function from this random starting point.

- How to find the local minimum loss: In basic backpropagation, we used gradient descent using a fixed learning rate and a first derivative update to traverse the potential solution space of weight matries; however, there is good reason to believe there might be more efficient ways to find a local minimum.

In fact, both of these have turned out to be key considerations towards progress in deep learning research.

RNN and LSTMs

 Let's imagine we are trying to predict the next word in a sentence, given the words up until this point. A neural network that attempted to predict the next word would need to take into account not only the current word but a variable number of prior inputs. If we instead used only a simple feedforward MLP, the network would essentially process the entire sentence of each word as a vector. This introduces the problem of either having to pad variable-length inputs to a common length and not preserving any notion of correlation (that is, which words in the sentence are more relevant than others in generating the next prediction), or only using the last word at each step as the input, which removes the context of the rest of the sentence and all the information it can provide. This kind of problem inspired the "vanilla" RNN which incorporates not only the current input but the prior step's hidden state in computing a neuron's output:

One way to visualize this is to imagine each layer feeding recursively into the next timestep in a sequence. In effect, if we "unroll" each part of the sequence, we end up with a very deep neural network, where each layer shares the same weights.

The same difficulties that characterize training deep feedforward networks also apply to RNNs; gradients tend to die out over long distances using traditional activation functions (or explode if the gradients become greater than 1).

However, unlike feedforward networks, RNNs aren't trained with traditional backpropagation, but rather a variant known as backpropagation through time(BPTT): the network is unrolled, as before, and backpropagation is used, averaging over errors at eatch time point(since an "output," the hidden state, occurs at each step). Also, in the case of RNNs, we run into the problem that the network has a very short memory; it only incorporates information from the most recent unit before the current one and has trouble maintaining long-range context. For applications such as traslation, this is clearly a problem, as the interpretation of a ward at the end of a sentence may depend on terms near the beginning, not just those directly preceding it.

The LSTM network was developed to allow RNNs to maintain a context or state over long sequences.

In a vanilla RNN, we only maintain a short-term memory h coming from the prior step's hidden unit activations, In addition to this short-term memory, the LSTM architecture introduces an additional layer c, the "long-term" memory, which can persist over many timesteps. The design is in some ways reminiscent of an electrical capacitor, which can use the c layer to store up or hold "charge," and discharge it once it has reached some threshold. To compute these updates, an LSTM unit consists of a number of related neurons, or gates, that act together to transform the input at each time step.

Given an input vector x, and the hidden state h, at the previous time t-1, at each time step an LSTM first computers a value from 0 to 1 for each element of c representing what fraction of information is "forgotten" of each element of the vector:

We make a second, similar calculation to determine what from the input value to preserve:

We now know which elements of c are updated; we can compute this update as follows:

where o is a Hadamard product (element-wise multiplication). In essence this equation tells us how to compute updates using the tanh transform, filter them using the input gate, and combine them with the prior time step's long-term memory using the forget gate to potentially filter out old values.


To compute the output at each time step, we compute another output gate:

And to compute the final output at each step (the hidden layer fed as short-term memory to the next step) we have:

Many variants of this basic design have been proposed; for example, the "peephole" LSTM substituted h(t-1) with c(t-1)(thus each operation gets to "peep" at the longterm memory cell), while the GRU simplifies the overall design by removing the output gate. What these designs all have in common is that they avoid the vanishing (or exploding) gradient difficulties seen during the training of RNNs, since the long-term memory acts as a buffer to maintain the gradient and propagate neuronal activations over many timesteps.


Networks for sequence data

 In addition to image data, natural language text has also been a frequent topic of interest in neural network research. However, unlike the datasets we've examined thus far, language has a distinct order that is important to its meaning. Thus, to accurately capture the patterns in language or time-dependent data, it is necessary to utilize networks designed for this purpose.

AlexNet architecture

 While the architecture of AlexNet shown in Figure 3.12 might look intimidating, it is not so difficult to understand once we break up this large model into individual processing steops. Let's start with the input images and trace how the output classification is computed for each image through a series of transformations performed by each subsequent layer of the neural network.

The input image to AlexNet are size 244 * 244 * 3(for RGB channels). The first layer consists of groups of 96 units and 11 * 11 * 3 kernels; the output is response normalized(as described previously) and max pooled. Max pooling is an operation that takes the maximum value over an n * n grid to register whether a pattern appeared anywhere in the input; this is again a form of positional invariance.

The second layer is also a set of kernels of size 5 * 5* 8 in groups of 256. The third through to fifth hidden layers have additional convolutions, without normalization, followed by two fully connected layers and an output of size 1,000 representing the possible iamge classes in ImnageNet. The authors of AlexNet used several GPUsto train the model, and this acceleration is important to the output.

Looking at the features learned during training in the initial 11 * 11 * 3 convolutions(Figure 3.13), we can see recongnizable edges and colors. While the authors of AlexNet don't show examples of neuraons higher in the network that synthesize these basic features, an iullustration is provided by another study in which researchers trained a large CNN to classify images in YouTube videos, yielding a neuraon in the uppper reaches of the network that appeared to be a cat detector.

This overview should give you an idea of why CNN architectures look the way they do, and what developments have allowed them to become more tractable as the basis for image classifiers or image-based generative models over time. We will not turn to secound class of more specialized architectures-RNN-that's used to develop time or sequence-based models.


AlexNet and other CNN innovations

 A 2012 article that produced state-of-the-art results classifying the 1.3 milliion images in ImageNet into 1,000 classes useing a model termed AlexNet demostrates some of the later innovations that made training these kinds of models practical. One, as I've alluded to before, is using ReLU in place of sigmoids or hyperbolic tangent function. A ReLU is function of the form:

y = max(0,x)

In contract to the sigmoid function, or tanh, in which the derivative shrinks to 0 as the function is saturated, the ReLU function has a constant gradient and a discontinuity at 0(Figure 3.10). This means that the gradient does not saturate and causes deeper layers of the network to train more slowly, leading to intractable optimization.

While advantageous due to non-vanising gradients and their low computational requirements (as they are simply thresholded linear transforms), ReLU function have the downside that they can "turn off" if the input falls below 0, leading again to 1 0 gradient. This deficiency was resolved by later work in which a "leak" below 0 was introduced:

y = x if x>0, else 0.01x

A further refinement is to make this threshold adaptive with a slope a, the Parameterized Leak ReLU(PReLU).

y = max(ax, x) fi a <= 1

Another trick used by AlexNet is dropout. The idea of dropout is inspired by ensemble methods in which we average the predictions of many model to obtain more robust result. Clearly for deep neural networks this is prohibitive; thus a compromise is to randomly set the values of subset of neurons to 0 with a probability of 0.5. These values are reset with every forward pass of backpropagation, allowing the network to effectively sample different architectures since the "dropped out" neurons don't participate in the output in that pass.

Yet anoter enhancement used in AlexNet is local response normalization. Even though ReLUs don't saturate in the same manner as other units, the authors of the model still found value in constraining the range of output. For example, in an individual kernel, they normalized the input using values of adjacent kernels, mearning the overall response was rescaled.

where a is the unnormalized output at a given x, y location on an image, the sum over j is over adjacent kernels, and B, k, and alpha are hyperparameters. This rescaling is reminiscent of a later innovation used widely in both convolutional and other neural network architectures, batch normalizaiton. Batch normalization also applies a transformation on "raw" activations within a network:

 where x is the unnormalized output, and B and y are scale and shift parameters. This transformation is widely applied in manyu neural network architectures to accelerate trainning, through the exact reason why it is effective remains a topic of debate.

Now that you have an idea of some of the methodological advances that made training large CNNs possible, let's examine the structure of AlexJNet to see some additional architectureal components that we will use in the CNNs we implement in generative models in later chapters.

Early CNNs

 This idea of columns inspired early research into CNN architectures. Instead of learning individual weights between units as in a feedforward network, this architecture uses shared weights within a group of neurons specialized to detect a specific edge in an image. The initial layer of the network (denoted H1) consists of 12 groups of 64 neurons each. Each of these groups is derived by passing a 5 * 5 grid over the 16 * 16-pixel input image; each of the 64 5*5 grids in this group share the same weights, but are tired to different spatial regions of the input. You can see that there must be 64 neurons in each group to cover the input image if their receptive fields overlap by two pixels.

When combined, these 12 groups of neurons in layer H1 form 12 8*8 grids representing the presence or absence of a particular edge within a part of the image - the 8 * 8 grid is effectively a down-sampled version of the image(Figure 3.9). This weight sharing makes intutive sense in that the kernel represented by the weight is specified to detect a distinct color and/or shape, regardless of where it appears in the image. An effect of this down-sampling is a degree of positional invariance; we only know the edge occurred somewhere within a resion of the image, but not the exact location due to the reduced resolution from downsampling. Because they are computed by multiplying a 5*5 matrix(kernel) with a part of the image, an operation used in image blurring and other transformations, these 5*5 input features are known as convlutional kernels, and give the network its name.

Once we have these 128*8 downsampled versions of the image, the next layer(H2) also has 12 groups of neurons; here, the kernels are 5*5*8 - they traverse the surface of an 8*8 map from H1, across 8 of the 12 troups. We need 16 neurons of these 5*5*8 groups since a 5*5 grid can be moved over four times up and down on an 8 * 8 grid to cover all the pixels in the 8 * 8 grid.

Just like deeper cells in the visual cortex, the deeper layers in the network integrate across multiple columns to combine information from different edge detectors together.

Finally, the third hidden layer of this network (H3) contains all-to-all connections between 30 hidden units and the 12 * 16  units in the H2, just as in a traditional feedforward network; a final output of 10 unit classifies the input image as one of 10 hand-drawn digits.

Through weight sharing, the overall number of free parameters in this network is reduced, though it is still large in absolute terms. While backpropagation was used successfully for this task, it required a carefully designed network for a rather limited set of images with a restricted set of outcomes - for real-world applications, such as detection objects from hundreds or thousands of possible categories, other approaches would be necessary.

2022년 2월 26일 토요일

Networks for seeing: Convolutional architectures

 As noted at the beginning of this chapter, one of the inspirations for deep neural network models is the biological nervous system. As researchers attempted to design computer vision systems that would mimic the functioning of the vissual system, they turned to the architecture of the retina, as revealed by physiological studies by neurobiologists David Huber and Torsten Weisel in the 1960s. As previously described, the physiologist Santiago Ramon Y Cajal provided visual evidence that neural structures such as the retina are arranged in vertical networks:

Huber and Weisel studied the retinal system in cats, showing how their perception of shapes is composed of the activity of individual cells arranged in a column. Each column of cells is designed to detect a specific orientation of an edge in an input image; images of complex shapes are stitched together from these simpler images.

Varieties of networks: Convolution and recursive

 Up until now we've primarily discussed the basics of neural networks by referencing feedforward networks, where every input is connected to every output in each layer. 

While these feedforward networks are useful for illustrating how deep networks are trained, they are only on class of a broader set of architectures used in modern applications, including generative models, Thus, before covering some of the techniques that make training large networks practical, let's review these alternative deep models.


The shortfalls of backpropagation

 While the backpropagation procedure provides a way to update interior weights within the network in a principled way, it has several shortcomings that make deep networks difficult to use in practice. One is the problem of vanishing gradients. In out derivation of the backpropagation formulas, you saw that gradients for weights depper in the network are product of successive partial derivatives from higher layers. In our example, we used the sigmoid function; if we plot out the value of the sigmoid and its first derivative, we can see a potential problem:


As the value of the sigmoid function increase or decrease towards the extremes (0 or 1, representing being either "off" or "on"), the values of the gradient vanish to near zero. This means that the updates to w and b, which are products of these gradients from hidden activation functions y, shrink towards zero, making the weights change little between iterations and making the parameters of the hidden layer neurons change very slowly during backpropagation. Clearly one problem here is that the sigmoid function saturates; thus, choosing another nonliearity might circumvent this problem (this is indeed one of the solutions that was proposed as the ReLU, as we'll cover later).

Another problem is more subtle, and has to do with how the network utilizes its available free parameters. As you saw in Chapter 1, An Introduction to Generative AI: "Drawing" Data from Models, a posterior probability of a variable can be computed as a product of a likelihood and a prior distribution. We can see deep neural networks as a graphical representation of this kind of probability: the ouput of the neuraon, depending updon its parameters, is a product of all the input values and the distributions on those inputs (the priors). A problem occurs when those values become tightly coupled. As an illustration, consider the competing hypotheses for a headache:


If a patient has cancer, the evidence is so overwhelming that whether they have a could or not profides no additional value; in essence, the vlaue of the two prior hypotheses becomes coupled because of the influence of one. This makes it intractable to compute the relative contribution of different parameters, particularly in a deep network; we will cover this problem in our discussion of Restricted Boltman Machine and Deep Belief Networks in Chapter 4, Teaching Networks to Generate Digits. As we will describe in more detail in that chapter, a 2006 study showed how to counteract this effect, and was one of the first demonstrations of tractable inference in deep neural networks, a breakthrough that relied upon a generative model that produced images of hand-drawn digits.

Beyond these concerns, other challenges in the more widespread adaption of neural networks in the 1990s and early 200-s were the availability of methods such as Support Vector Machines, Gradient and Stochastic Gradient Bootstring Models, Random Forests, and even penalized regression methods such as LASSO and Elastic Net, for classification and regression tasks.

While, in theory, deep neural networks had potentially greater representational power than these models since they built hierarchical representations of the input data through successive layers in contrast to the "shallow" representation given by a single transformation such as a regression weight or decision tree, in practice the Challenges of training deep networks made these "shallo" methods more attrative for practical applications. This was coupled with the fact that larger networks required tuning thousands or even millions of parameters, requiring larg-scale matrix calculations that were infeasible before the explosion of cheap compute resources - including GPUs and TPUs especially suited to rapid matrix calculations - available from cloud vendors made these experiments practical.

Now that we've covered the basics of training simple network architectures, let's turn to more complex models that will form the building blocks of many of the generative models in the rest of the book: CNNs and sequence models (RNNs, LSTMs, and other).


Backpropagation in practice

 While it is useful to go through this derivation in order to understand how the update rules for a deep neural network are derived, this would clearly quickly become unwieldy for large networks and complex architectures. It's fortunate, therefore, that TensorFlow 2 handles the computation of these gradients automatically. During the initialization of the model, each gradient is computed as an intermediate node between tensors and operations in the graph: as an example, see Figure 3.4:


The left side of the preceding figure shows a cost function C computed from the output of a Rectified Linear Unit (ReLU) - a type of neuron function we'll cover later in this chapter), which in turn is computed from multiplying a weight vector by an input x and adding a bias term b. On the right, you can see that this graph has been augmented by TensorFlow to compute all the intermediate gradients required for backpropagation as part of the overall control flow.


After storing these intermediate values, the task of combining them, as shown in the calculation in Figure 3.4, into a complete gradient through recursive operation falls to the GradientTape API. Under the hood, TensorFlow uses a method called reversemode automatic differentiation to compute gradients; it holds the dependent variable (the output y) fixed, and recursively computes backwards to the beginning of the network the required gradients.

For example, let's consider a neural network of the following form:


If we want to compute the derivative of the output y with respect to an input x we need to repeatedly substitute the outermost expression.


Thus, to compute the desired gradient we need to just traverse the graph from top to bottom, storing each intermediate gradient as we calculate it. These values are stored on a record, referred to as a tape in reference to early computers in which information was stored on a magnetic taps, which is then used to replay the values for calcuation. The alternative would be to use forward-mode automatic differentiation, computing, from bottom to top. This requires two instead of one pass(for each branch feeding into the final value), but is conceptually simpler to implement and doesn't require the storage memory of reverse mode. More importantly, though, reverse-mode minics the derivation of backpropagation that I described earlier.

The taps (aslo known as the Wengert Tape, after one of its devcelopers) is actually a data structure that you can access in the TensorFlow Core API. As an example, import the core library:

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf


The tape is then available using the tf.gradientTape() method, with which you can evaluate gradients with respect to intermediate values within the graph:


x = tf.one((2,2))

    with tf.GradientTape() as t:

        t.watch(x)

        y = tf.reduce_sum(x)

        z = tf.mutiply(y,y)

    # use the tap to compute the derivative of z with respect to the

    # intermediate value y.

    dz_dy = t.gredient(z,y)

    # note that the resulting derivative, 2*y, = sum(x)  *2 = 8

    assert bz_dy.numpy() == 8.0

By defualt, the memory resources used by GradientTape() are released once gradient() is called; however, you can also use the persistent argument to store these results:

x = tf.constant(3.0)

with tf.GradientTape(persistent = true) as t:

    t.watch(x)

    y = x * x

    z = y * y

dz_dx = t.gradient(z, x) # 108.0 (4*x ^3 at x =3)

dy_dx = t.gradient(y, x) # 6.0


Now that you've seen how TensorFlow computes gradients in practice to evaluate backpropagation, let's return to the details of how the backpropagation technique evolved over time in response to challenges in practical implementation.



Multi-layer perceptrons and backpropagation

 While large research funding for neural networks declined until the 1980s after the publication of Perceptrons, researchers still recongnized that these models had value, particularly when assembled into multi-layer networks, each composed of several perceptron units. Indeed, when the mathematical form of the output function (that is, the output of the model) was relaxed to take on manu forms (such as a linear function or a sigmoid), these networks could solve both regression and classification problems, with theoretical results showing that 3-layer networks could effectively approximate any output. However, none of this work address the practical limitations of computing the solutions to these models, with rules such as the perceptron learning algorithm described earlier proving a great limitation to the applied use of them.

Renewed interest in neural networks came with the popularization of the backpropagation algorithm, which, while discovered in the 1960s, was not widely applied to neural networks until the 1980s, following serveral studies highlighting it usefulness for learning the weights in these models. As you saw with the perceptron model, a learning urle to update weights is relatively easy to derive as long as there are no "hidden" layers. The input is transformed once by the perceptron to compute an output value, meaning the weights can be directly tuned to yield the desired output. When there are hidden layers between the input and output, the problem becomes more complex: when do we change the internal weights to compute the activations that feed into the final output? How do we modify them in relation to the input weights?

Then insight of the backpropagation technique is that we can use the chain rule from calculus to efficiently compute the derivatives of each parameter of a network with respect to a loss function and, combined with a learning rule, this provides a scalable way to train multilayer networks.

Let's illustrate backpropagation with an example: consider a network like the one shown in Figure 3.3. Assume that the output in the final layer is computed using a sigmoidal function, which yields a value between 0 and 1:


Furthermore, the value y, the sum of the inputs to the final neuron, is a weighted sum of the sigmoidal inputs of the hidden units:


We also need a notion of when the network is performing well or badly at its task. A straightforward error function to use here is squared loss:


where yhat is the estimated value (from the output of the model) and y is the real value, summed over all the input examples J and the output of the network K (where K=1, since there is only a single output value). Backpropagation begins with a "forwar pass" where we compute the values of all the outputs in the inner and outer layers, to obtain the estimated values of yhat. We then proceed with a backward step to compute gradients to update the weights.

Our overall objective is to compute partial derivatives for the weights w and bias terms b in each neuron:&E/&w and &E/&b, which will allow us to compute the updates for b and w. Towards this goal, let's start by computing the update rule for  the inputs in the final neuron; we want to date the partial derivative of the error E with respect to each of these inputs(in this example there are five, corresponding to the five hidden layer neurons), using the chain rule:

which for an individual example is just the difference between the input and output value. we need to take the partial derivative of the sigmoid function:


Putting it all together, we have:


If we want to compute the gradient for a particular parameter of x, such as a weight w or bias term b, we need on more step:


We already know the first term and x depends on w only through the inputs from the lower layers y since it is a linear function, so we obtain:


If  we want to compute this derivative for one of the neurons in the hidden layer, we likewise take the partial derivative with respect to this input y, which is simply:


So, in total we can sum over all units that feed into this hidden layer:


We can repeat this process recursively for any units in deeper layers to obtain the desired update rule, since we now know how to calculate the gradients for y or w at any layer. This makes the process of updating weights efficient since once we h ave computed the gradients through the backward pass we can combine consecutive gradients through the layers to get the required gradient at any depth of the network.


Now that we have the gradients for each w (or other parameter of the neuron we might want to calculate), how can we make a "learning rule" to update the weight? In their paper, Hinton et al. noted tat we could apply an update to the model parameters after computing gradients on each sample batch but suggested instead applying an update cakculated after averaging over all samples. The gradient represents the direction in which the error function is changing with the greatest magnitude with respect to the parameters; thus, to update, we want to push the weight in the opposite direction, with (w) the update, and e a small value (a step size):


Then at each time t during training we update the weight using this calculated gradient:


where alpha is a decay parameter to weight the contribution of prior updates ranging from 0 to 1. Following this procedure, we would initialize the weights in the network with some small random values, choose a step size e and iterate with forward and backward passes, along with updates to the parameters, until the loss function reaches some desired value.

Now that we have described the formal mathematics behind backpropagation, let us look at how it is implemented in practice in software packages such as TensorFlow 2.



From TLUs to tuning perceptrons

 Besides these limitations for representing the XOR and XNOR operations, there are additional simplifications that cap the representational power of the TLU model; the weights are fixed, and the output can only be binary (0 or 1). Clearly, for a system such as a neuron to "learn," it needs to respond to the environment and determine the relevance of different inputs based on feedback from prior experiences. This idea was captured in the 1949 book Organization of Behavior by Canadian Psychologist Donald Hebb, who proposed that the activity of nearby neuraonal cells would tend to synchronize over time, sometimes paraphrased at Hebb's Law: Neurons that fire together wire together. Building on Hubb's proposal that weights changed over time, researcher Frank Rosenblatt of the Cornell Aeronautical Laboratory proposed the perceptron model in the 1950s. He replaced the fixed weights in the TLU model with adaptive weights and added a bias term, giving a new function:

We note that the inputs I have been denoted X to underscore the fact that they could be any value, not just binary 0 or 1. Combining Hebb's observations with the TLU model, the weights of the perceptron would be updated according to a simple learning rule:

1. Start with a set of J samples x(1).....x(j). These samples all have a label y which is 0 or 1, giving labeled data(y, x)(1)...(y,x)(j). These samples could have either a single value, in which case the perceptron has a single input , or be a vector with length N and indices i for multi-value input.

2. Initialize all weights w to a small random value or 0.

3. Compute the estimated value, yhat, for all the examples x using the perceptron function.

4. Update the weights using a learning rate r to more closely match the input to the desired output for each step t in training:

wi(t+1) = wi(t) + r(yi - yhati)xji, for all J samples and Nfeatures. 

Conceptually, note that if y is 0 and the target is 1, we want to increase the value of the weight by some increment r; likewike, if the target is 0 and the estimate is 1, we want to decrease the weight so the inputs do not exceed the threshold.

5. Repeat step 3-4 until the difference between the prediced and actual ouputs, y and yhat, falls below some desired threshold. In the case of a non zero bias term, b, an update can be computed as well using a similar formula.


While simple, you can appreciate that many patterns could be learned from such a clasifier, though still not the XOR function, However, by combining serveral perceptrons into multiple layers, these units could represent any simple Boolean function, and indeed McCulloch and Pitts had previously speculated on combining such simple units into a universal computeatation engine, or Turing Machine, that could represent any operation in a standard programming language. However, the preceding learning algorithm operates on each unit independently, meaning it could be extended to networks composed of many layers of perceptrons.


however, the 1969 book Percetrons, by MIT computer scientists Marvin Minksy and Seymour Papert, demonstrated that a three-layer feed-forward network required complete (non-zero weight) connections between at least one of these units (in the first layer) and all inputs to compute all possible logical outputs. This meant that instead of having a very sparese structure, like biological neurons, which are only oconnected to a few of their neighbors, these computational modles required very dense connections.

While connective sparsity has been incorporated in later architectures, such as CNNs, such dense connections remain a feature of many models too, particularly in the fully connected layers that oftern form the secound to last hidden layers in models. In addition to these models being computationally unwieldy on the hardware of the day, the observation that spare models could not compute all logical operations was interpreted more broadly by the research community as Perceptrons cannot compute XOR. While erroneous, this message led to a drought in funding for AI in subsequent years, a period sometimes refferred to as the AI Winter.

The next revolution in neural network research would require a more efficient way to compute the required parameters updated in complex models, a technique that would become known as backpropagation.



From tissues to TLUs

 The recent popularity of AI algorithms might give the false impression that this field is new. Many recent models are based on discoveries made decades ago that have been reinvigorated by the massive computational resources available in the could and customized hardware for parallel matrix computations such as Graphical Processing Units(GPUs, Tensor Processing Units(TPUs), and Field Programmable Gate Array(FPGAs). If we consider research on neural networks to include their biological inspiration as will as computaitonal theory, this field is over a hundred years old. Indeed, one of the first neural networks described appears in the detaild anatomical illustrations of 19th Century scientist Santiago Ramon y Cajal, whose illustrations based on experimental observation of layers of interconnected neuranal cells inspired the Neuraon Doctrine - the idea that the brain is composed of individual, physically distinct and specialized cells, rather than a single continuous network. The distinct layers of the retina observed by Cajal were also the inspiration for particular neural network architectures such as the CNN, which we will discuss later in this chapter.

This observation of simple neuranal cells interconnected in large networks led computaional researchers to hypothesize how mental activity might bve represented by simple, logical operations that, combined, yield complex mental phenomena, The original "automata theory" is usually traced to a 1943 article by Warren McCulloch and Walter Pitts of the Massachusetts Institue of Technology. They described a simple model know as the Threshold Logic Unit(TLU), in which binary inputs are translated into a binary output based on a threshold:
where I is the input values, W is the weights with ranges from (0,1) or (-1,1), and f is a threshold function that converts these inputs into a binary output depending upon whether they exceed a threshold T.

f(x) = 1 if x > T, else 0

Visually and conceptually, there is some similarity between McCulloch and Pitts model and the biological neuron that inspired it. Their model integrates inputs into an output signal, just as the natural dendrites (short, input "arms" of the neuron that receive signals from other cells) of a neuraon synthesize inputs into a single output via the axon (this long "tail" of the cell, which passes signals received from the dendrites along to other neurons). We might imagine that, just as neuraonal cells are composed into networks to yield complex biological circuits, these simple units might be connected to simulate sophisticated decision processes.

Indeed, using this simple model, we can already start to represent several logical operations. If we consider a simple case of a neuron with one input, we can see that a TLU can solve an identity or negation function.

For an identity operation that simple returns the input as output, the weight matrix would have Is on the diagonal(or be simply the scalar 1, for a single numerical input, as illustrated in Table 1);


Similarly, for a negation operation, the weight matrix could be a negative identity matrix, with a threshold at 0 flipping the sign of the output from the input:


Given two inputs, a TLU could also represent operations such as AND and OR.

Here, a threshold could be set such that combined input values either have to exceed 2(to yield an output of 1)for an AND operation or 1(to yield an output of 1 if either of the two inputs are 1) in an OR operation.

However, a TLU cannot capture patterns such as Exclusive OR(XOR), which emits 1 if and only if the OR condition is true.


To see why this is true, consider a TLU with two inputs and positive weights of 1 for each unit. If the threshold value T is 1, then inputs of (0,0), (1,0), and (0,1) will yield the correct value. What happens with (1,1) though? Because the threshold function returns 1 for any inputs summing to greater than 1, it cannot represent XOP(Table 3.5), which would require a second threshold to compute a different output once a different, higher value is exceeded. Changing one or both of the weights to negative values won't help either; the problem is that the decision threshold operates only in one direction and can't be reversed for larger inputs.

Similarly, the TLU can't represent the negation of the Exclusive NOR, XNOR As with the XOR operation, the impossibility of the XNOR operation being represented by a TLU function can be illustrated by considering a weight matrix of two 1s; for two inputs (1,0) or (0,1), we obtain the correct value if we set a threshold of 2 for outputting 1. As with the XOR operation, we run into a problem with an input of (0,0), as we can't set a second threshold to output 1 at a sum of 0.






Perceptrons - a brain in a function

 The simplest neural network architecture - the perceptron - was inspired by biological research to understand the basis of mental processing in an attempt to represent the function of the brain with mathematical formulae. In this section we will cover some of this early research and how it inspired what is now the field of deep learning and generative AI.


3. Building Blocks of Deep Neural Networks

 The wide range of generative AI models that we will implement in this book are all built on the foundation of advances over the last decade in deep learning and neural networks. While in practice we could implement these projects without reference to historical developements, it will give you a richer understanding of how and why these models work to retrace their underlying components. In this chapter, we will dive into this backgournd, showing you how generative AI models are built from the ground up, how smailer units are assembled into complex architectures, how the loss functions in these models are optimized, and some current theories as to why these models are so effective. Armed with this background knowledge, you should be able to understand in greater depth the reasoning behind the more advanced models and topics that start in Chapter 4, Teaching Networks to Generate Digits, of this book. Generally speaking, we can group the building blocks of neural network models into a number of choices regarding how the model is constructed and trained, which we will cover in this chapter:


Which neural network architecture to use:

- Perceptron

- Multilayer perceptron (MLP)/FEEDFORWARD

- Convolutional Neural Networks (CNNs)

- Recurrent Neural Networks (RNNs)

- Long Short-Term Memory Networks (LSTMs)

- Gated Recurrent Units (GRUs)


Which activation functions to use in the network:

- Linear

- Sigmoid

- Tanh

- ReLU

- PReLU


What optimization algorithm to use to tune the parameters of the network:

- Stochastic Gradient Descent (SGD)

- RMSProp

- AdaGrad

- ADAM

- AdaDelta

- Hessian-free optimization


How to initialize the parameters of the network:

- Random

- Xavier initialization

- He initalization

As you can appreciate, the products of these decisions can lead to a huge number of potential neural network variants, and one of the challenges of developing these models is determining the right search space witin each of these choices. In the course of describing the history of neural networks we will discuss the implications of each of these model parameters in more detail. Our overview of this field begins with the origin of the discipline: the humble perceptron model.


Summary

 In this chapter, we have covered an overview of what TensorFlow is and how it serves and an improvement over earlier frameworks for deep learning research.

We also explored setting up an IDE, VSCode, and the foundation of reproducible applications, Docker containers. To orchestrate and deploy Docker containers, we discussed the Kubernetes framework, and how we can scale groups of containers using its API. Finally, I described Kubeflow, a maching learning framework built on Kubernetes which allows us to run end-to-end pipelines, distributed training. and parameter search, and serve trained models. We then set up a Kubeflow deployment using Terraform, an IaaS technology.

Before jumping into specific projects, we will enxt cover the basics of neural network theory and the TensorFlow and Keras commands that yuu will need to write basic training jobs on Kubeflow.


Using Kubeflow Katib to optimize model hyperparameters

 Katib is a framework running multiple instances of the same job with differing inputs, such as in neural architecture search ( for determining the right number and size of layers in a neural network) and hyperparameter search (finding the right learning rate, for example, for an algorithm). Like the other Kustomize templates we have seen, the TensorFlow job specifies a generic TensorFlow job, with placeholders for the parameters:


    apiVersion: "kubeflow.org/v1alpha3"

    kind: Experiment

    metadata:

        namespcae: kubeflow

        name: tfjob-example

    spec:

        parallelTrialCount: 3

        maxTrialCount: 12

        maxFaildTrialCount: 3

        objective:

            type: maximize

            goal: 0.99

            objectiveMetricName: accuracy_1

        algorithm:

            glgorithmName: random

        metricsCollectorSpec:

            source:

                fileSystemPath:

                    path: /train

                    kind: Directory

                collector:

                    king: TensorFlowEvent

            parameters:

                -name: --learning_rate

                parameterType: double

                feasibleSpace:

                    min: "0.01"

                    max: "0.05"

                -name: --batch_size

                parameterType: int

                feasibleSpce:

                    min: "100"

                    max: "200"

            trialTemplate:

                goTemplate:

                    rowTemplate: | -

                        apiVersion: "kubeflow.ortg/v1"

                        kind: TFJob

                        metadata:

                            name: {{.Trial}}

                            namespcae: {{.NameSpcae}}

                        spec:

                            tfReplicas: 1

                            restartPolicy: OnFailure

                            template:

                                spec:

                                    containers:

                                        -name: tensorflow

                                        image: gcr.io/kubeflow-ci/tf-manist-with-summaries:1.0

                                        imagePullPolicy: Always

                                        command:

                                            -"python"

                                            -"/var/tf_mnist/mnist_with_summaries.py"

                                            -"--log_dir=/train/metrics"

                                            {{- with .HyperParameters}}

                                            {{- range .}}

                                            - "{{.Name}}-{{.Value}}"

                                            {{- end}}

                                            {{- end}}

Which we can run using the familiar kubectl syntax:

kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alph3/tfjob-example.yaml


of though the UI

where you can see a visual of the outcome of these multi-parameter experiments, or a table.



Kubeflow pipelines

 For notebook servers, we gave an example of a single container (this notebook instace) application. Kubeflow also gives us the ability to run multi-container application worksflows(such as input data, training, and deployment) using the piplines functionality. Pipelines are Python functions that follow a Domain Specific Language(DSL) to specify components that will be compiled into containers.

If we click piplies on the UI, we are brought to a dashboard

Selecting one on these pipelines, we can ses a visual overview of the component containers


After create a new run, we can specify parameters for a particular instace of this  pipeline.

Once the pipeline is created, we can use the user interface to visualize the results.

Under the hood, the Python code to generate this pipline is compiled using the pipelines SDK. We could specify the components to come either from a container with Python code:


@kfp.dsl.componet

def my_component(my_pram):

    ...

    return kfp.dsl.ContainerOp(

        name='My componet name',

        image='gcr.io/path/to/container/image'

    )

    or a function written in Python itself:

    @kfp.dsl.python_component(

        name='My awesome component',

        description='Come and play',

    )

    def my_python_func(a: str, b: str) -> str:


For a pure Python function, we could turn this into an operation with the compiler:

my_op    =    compiler.build_python_component(

        component_func=my_python_func,

        staging_gcs_path=OUTPUT_DIR,

        target_imge=TARGET_IMAGE)


We then use the dsl.pipeline decorator to add this operation to a pipeline:

    @kfp.dsl.pipeline(

        name='My pipeline',

        description='My machine learning pipline'

    )

    def my_pipline(param_1: PipelineParam, param_2: PipelineParam):

        my_step = my_op(a='a', b='b')


We compile it using the following code:

    kfp.compiler.Compiler().compile(my_pipeline, 'my-pipeline.zip')

and run it with this code:

    client = ktf.Client()

    my_experiment = client.create_experiment(name='demo')

    my_run=client.run_pipeline(my_experiment.id, 'my-pipelie', 'my-pipeline.zip')

We can also upload this ZIP file to the pipelines UI, where Kubeflow can use the generated YAML, from compilation to instantiate the job.

Now that you have seen the process for generating results for a single pipeline, our next problem is how to generate the optimal parameters for such a pipeline. As you will see in Chapter 3, Building Blocks of Deep Neural Networks, neural network models typically have a number of layers, layer size, and connectivity) and training paradigm (such as learning rate and optimizer algorithm). Kubeflow has a built-in utility for optimizing models for such parameter grids, called Katib.

Kubeflow notebook servers

 We can use Kubeflow to start a Jupyter notebook server in a namespace, where we can run experimental code; we can start the notebook by clicking the Notebook Server tab in the user interface and selecting NEW SERVER

We can then specify parameters, such as which container to run(which could include the TensorFlow container we examined earlier in our discussion of Kocker), and how many resources to allocate.


You can also specify a Persistent Volumn(PV) to store data that remains even if the notebook server is turned off, and special resources such as GPUs.

Once started, if you have specified a container with TensorFlow resources, you cna begin running models in the notebook server.

A brief tour of Kubeflow's components

 Now that we have installed Kubeflow locally or in the cloud, let us take a look aganin at the Kubeflow dashboard

Let's walk through what is available in this toolkit. First, notice in the upper pannel we have a dropdown with the name anonymous specified-this is the namepsce for Kubernetes referred to earlier. While our default is anonymous, we could create several namespaces on our Kubeflow instance to accommodate different users or projects. This can be done at login, where we set up a profile

Alternatively, as with other operations in Kubernetes, we can apply a namespace using a YAML file:

apiVersion: kubeflow.org/v1beta1

kind: Profile

metadata:

    name: profileName

spec:

    owner:

        kind: User

        name: userid@eamil.com

Using the kubectl command:

kubectl create -f profile.yaml

What can we do once we have a namespace? Let us look through the available tools.

Installing Kubeflow using Terraform

 For each of these cloud providers, you'll probably notice that we have a common set of commands; creating a Kubenetes cluster, installing Kubeflow, and starting the application. While we can use scripts to automate this process, if would be desirable to, like our code, have a way to version control and persist different infrastructure configurations, allowing a reproducible recipe for creaating the set of resources we need to urn Kubeflow. If would also help us potentially move between cloud providers without completely rewriting our installation logic.

The template language Terraform(https://www.terraform.io/)was created by HashiCorp as a tool for Infrastructure as Service(IaaS). In the same way that Kubernetes has an API to update resources on a cluster, Terraform allows us to abstract interactions with different underlying cloud providers using an API and a template language using a command-line utility and core components written in GoLang(Figure 2.7). Terraform can be extended using user-written plugins.

Terraform Core  <---->     Providers

                       RPC       Provisioners                        Upstream APIs

                                      Plugins

                                    

                                    Client Library       

Let's look at one example of installing Kubeflow using Terraform instuctions on AWS, located at https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow. Once you have established the required AWS resources and installed terraform on an EC2 container, the aws-eks-cluster-and-nodegroup. tf Terraform file is used to create the Kubeflow cluster using the command:

terraform apply

In this file are a few key components. One is variables that specify aspects of the deployment:

variable "efs_throughput_mode" {

    description = "EFS performance mode"

    default = "burstring"

    type = string

}

Another is specification for which cloud provider we are using:

provider "aws" {

    region    =    var.region

    shared_credentials_file    = var.credentials 

    resrouce "aws_eks_cluster"    "eks_cluster" {

        name    =    var.cluster_name

        role_arn    =    aws_iam_role.cluster.role.arn

        version     =    var.k8s_version


    vpc_config {

        security_group_ids    =    [aws_security_group.cluster_sg.id]

        subnet_ids    =    flatten([aws_subnet.subnet.*.id])

    }

    depends_on    = [

        aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy,

        aws_iam_role_policy_attachment.cluster_AmazonKSServicePolicy,

    ]

    provisioner    "local-exec"       {

        command    =    "aws --region ${var.region} eks update-kubeconfig --name ${aws_eks_cluster.eks_cluster.name}"

    }

    provisioner    "local-exec"    {

        when    =    destroy

        command    =    "kubectl config unset current-context"

    }

}

    profile    =    var.profile

}

And another is resources such as the EKS cluster:

resource    "aws_eks_cluster"    "eks_cluster"{

    name    =    var.cluster_name

    role_arn    =    aws_iam_role.cluster_role.arn

    version    =    var.k8s_version


    vpc_config {

        security_group_ids    =    [aws_security_group.cluster_sg.id]

        subnet_ids    =    flatten([aws_subnet.subnet.*.id])

    }

    depends_on    =    [

        aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy,

        aws_iam_role_policy_attachment.cluster_AmazonEKSServicePolicy,

    }

    provisioner    "local-exec"    {

        command    =    "aws --region ${var.region} eks update-kubeconfig --name ${aws_eks_cluster.eks_cluster.name}"

    }

    provisioner    "local-exec"    {

        when    =    destroy   

        command    =    "kubectl config unset current-context"

    }

}

Every time run the Terraform apply command, it walks through this to determine what resources to create, which underlying AWS services to call to create them, and with which set of configuration they should be provisioned. This provides a clean way to orchestrate complex installations such as Kubeflow in a versioned, extensible template language.

Now that we have successfully installed Kubeflow either locally or on a managed Kubernetes control plane in the cloud, let us take a look at what tools are abailable on the platform.