페이지

2022년 2월 26일 토요일

From TLUs to tuning perceptrons

 Besides these limitations for representing the XOR and XNOR operations, there are additional simplifications that cap the representational power of the TLU model; the weights are fixed, and the output can only be binary (0 or 1). Clearly, for a system such as a neuron to "learn," it needs to respond to the environment and determine the relevance of different inputs based on feedback from prior experiences. This idea was captured in the 1949 book Organization of Behavior by Canadian Psychologist Donald Hebb, who proposed that the activity of nearby neuraonal cells would tend to synchronize over time, sometimes paraphrased at Hebb's Law: Neurons that fire together wire together. Building on Hubb's proposal that weights changed over time, researcher Frank Rosenblatt of the Cornell Aeronautical Laboratory proposed the perceptron model in the 1950s. He replaced the fixed weights in the TLU model with adaptive weights and added a bias term, giving a new function:

We note that the inputs I have been denoted X to underscore the fact that they could be any value, not just binary 0 or 1. Combining Hebb's observations with the TLU model, the weights of the perceptron would be updated according to a simple learning rule:

1. Start with a set of J samples x(1).....x(j). These samples all have a label y which is 0 or 1, giving labeled data(y, x)(1)...(y,x)(j). These samples could have either a single value, in which case the perceptron has a single input , or be a vector with length N and indices i for multi-value input.

2. Initialize all weights w to a small random value or 0.

3. Compute the estimated value, yhat, for all the examples x using the perceptron function.

4. Update the weights using a learning rate r to more closely match the input to the desired output for each step t in training:

wi(t+1) = wi(t) + r(yi - yhati)xji, for all J samples and Nfeatures. 

Conceptually, note that if y is 0 and the target is 1, we want to increase the value of the weight by some increment r; likewike, if the target is 0 and the estimate is 1, we want to decrease the weight so the inputs do not exceed the threshold.

5. Repeat step 3-4 until the difference between the prediced and actual ouputs, y and yhat, falls below some desired threshold. In the case of a non zero bias term, b, an update can be computed as well using a similar formula.


While simple, you can appreciate that many patterns could be learned from such a clasifier, though still not the XOR function, However, by combining serveral perceptrons into multiple layers, these units could represent any simple Boolean function, and indeed McCulloch and Pitts had previously speculated on combining such simple units into a universal computeatation engine, or Turing Machine, that could represent any operation in a standard programming language. However, the preceding learning algorithm operates on each unit independently, meaning it could be extended to networks composed of many layers of perceptrons.


however, the 1969 book Percetrons, by MIT computer scientists Marvin Minksy and Seymour Papert, demonstrated that a three-layer feed-forward network required complete (non-zero weight) connections between at least one of these units (in the first layer) and all inputs to compute all possible logical outputs. This meant that instead of having a very sparese structure, like biological neurons, which are only oconnected to a few of their neighbors, these computational modles required very dense connections.

While connective sparsity has been incorporated in later architectures, such as CNNs, such dense connections remain a feature of many models too, particularly in the fully connected layers that oftern form the secound to last hidden layers in models. In addition to these models being computationally unwieldy on the hardware of the day, the observation that spare models could not compute all logical operations was interpreted more broadly by the research community as Perceptrons cannot compute XOR. While erroneous, this message led to a drought in funding for AI in subsequent years, a period sometimes refferred to as the AI Winter.

The next revolution in neural network research would require a more efficient way to compute the required parameters updated in complex models, a technique that would become known as backpropagation.



From tissues to TLUs

 The recent popularity of AI algorithms might give the false impression that this field is new. Many recent models are based on discoveries made decades ago that have been reinvigorated by the massive computational resources available in the could and customized hardware for parallel matrix computations such as Graphical Processing Units(GPUs, Tensor Processing Units(TPUs), and Field Programmable Gate Array(FPGAs). If we consider research on neural networks to include their biological inspiration as will as computaitonal theory, this field is over a hundred years old. Indeed, one of the first neural networks described appears in the detaild anatomical illustrations of 19th Century scientist Santiago Ramon y Cajal, whose illustrations based on experimental observation of layers of interconnected neuranal cells inspired the Neuraon Doctrine - the idea that the brain is composed of individual, physically distinct and specialized cells, rather than a single continuous network. The distinct layers of the retina observed by Cajal were also the inspiration for particular neural network architectures such as the CNN, which we will discuss later in this chapter.

This observation of simple neuranal cells interconnected in large networks led computaional researchers to hypothesize how mental activity might bve represented by simple, logical operations that, combined, yield complex mental phenomena, The original "automata theory" is usually traced to a 1943 article by Warren McCulloch and Walter Pitts of the Massachusetts Institue of Technology. They described a simple model know as the Threshold Logic Unit(TLU), in which binary inputs are translated into a binary output based on a threshold:
where I is the input values, W is the weights with ranges from (0,1) or (-1,1), and f is a threshold function that converts these inputs into a binary output depending upon whether they exceed a threshold T.

f(x) = 1 if x > T, else 0

Visually and conceptually, there is some similarity between McCulloch and Pitts model and the biological neuron that inspired it. Their model integrates inputs into an output signal, just as the natural dendrites (short, input "arms" of the neuron that receive signals from other cells) of a neuraon synthesize inputs into a single output via the axon (this long "tail" of the cell, which passes signals received from the dendrites along to other neurons). We might imagine that, just as neuraonal cells are composed into networks to yield complex biological circuits, these simple units might be connected to simulate sophisticated decision processes.

Indeed, using this simple model, we can already start to represent several logical operations. If we consider a simple case of a neuron with one input, we can see that a TLU can solve an identity or negation function.

For an identity operation that simple returns the input as output, the weight matrix would have Is on the diagonal(or be simply the scalar 1, for a single numerical input, as illustrated in Table 1);


Similarly, for a negation operation, the weight matrix could be a negative identity matrix, with a threshold at 0 flipping the sign of the output from the input:


Given two inputs, a TLU could also represent operations such as AND and OR.

Here, a threshold could be set such that combined input values either have to exceed 2(to yield an output of 1)for an AND operation or 1(to yield an output of 1 if either of the two inputs are 1) in an OR operation.

However, a TLU cannot capture patterns such as Exclusive OR(XOR), which emits 1 if and only if the OR condition is true.


To see why this is true, consider a TLU with two inputs and positive weights of 1 for each unit. If the threshold value T is 1, then inputs of (0,0), (1,0), and (0,1) will yield the correct value. What happens with (1,1) though? Because the threshold function returns 1 for any inputs summing to greater than 1, it cannot represent XOP(Table 3.5), which would require a second threshold to compute a different output once a different, higher value is exceeded. Changing one or both of the weights to negative values won't help either; the problem is that the decision threshold operates only in one direction and can't be reversed for larger inputs.

Similarly, the TLU can't represent the negation of the Exclusive NOR, XNOR As with the XOR operation, the impossibility of the XNOR operation being represented by a TLU function can be illustrated by considering a weight matrix of two 1s; for two inputs (1,0) or (0,1), we obtain the correct value if we set a threshold of 2 for outputting 1. As with the XOR operation, we run into a problem with an input of (0,0), as we can't set a second threshold to output 1 at a sum of 0.






Perceptrons - a brain in a function

 The simplest neural network architecture - the perceptron - was inspired by biological research to understand the basis of mental processing in an attempt to represent the function of the brain with mathematical formulae. In this section we will cover some of this early research and how it inspired what is now the field of deep learning and generative AI.


3. Building Blocks of Deep Neural Networks

 The wide range of generative AI models that we will implement in this book are all built on the foundation of advances over the last decade in deep learning and neural networks. While in practice we could implement these projects without reference to historical developements, it will give you a richer understanding of how and why these models work to retrace their underlying components. In this chapter, we will dive into this backgournd, showing you how generative AI models are built from the ground up, how smailer units are assembled into complex architectures, how the loss functions in these models are optimized, and some current theories as to why these models are so effective. Armed with this background knowledge, you should be able to understand in greater depth the reasoning behind the more advanced models and topics that start in Chapter 4, Teaching Networks to Generate Digits, of this book. Generally speaking, we can group the building blocks of neural network models into a number of choices regarding how the model is constructed and trained, which we will cover in this chapter:


Which neural network architecture to use:

- Perceptron

- Multilayer perceptron (MLP)/FEEDFORWARD

- Convolutional Neural Networks (CNNs)

- Recurrent Neural Networks (RNNs)

- Long Short-Term Memory Networks (LSTMs)

- Gated Recurrent Units (GRUs)


Which activation functions to use in the network:

- Linear

- Sigmoid

- Tanh

- ReLU

- PReLU


What optimization algorithm to use to tune the parameters of the network:

- Stochastic Gradient Descent (SGD)

- RMSProp

- AdaGrad

- ADAM

- AdaDelta

- Hessian-free optimization


How to initialize the parameters of the network:

- Random

- Xavier initialization

- He initalization

As you can appreciate, the products of these decisions can lead to a huge number of potential neural network variants, and one of the challenges of developing these models is determining the right search space witin each of these choices. In the course of describing the history of neural networks we will discuss the implications of each of these model parameters in more detail. Our overview of this field begins with the origin of the discipline: the humble perceptron model.


Summary

 In this chapter, we have covered an overview of what TensorFlow is and how it serves and an improvement over earlier frameworks for deep learning research.

We also explored setting up an IDE, VSCode, and the foundation of reproducible applications, Docker containers. To orchestrate and deploy Docker containers, we discussed the Kubernetes framework, and how we can scale groups of containers using its API. Finally, I described Kubeflow, a maching learning framework built on Kubernetes which allows us to run end-to-end pipelines, distributed training. and parameter search, and serve trained models. We then set up a Kubeflow deployment using Terraform, an IaaS technology.

Before jumping into specific projects, we will enxt cover the basics of neural network theory and the TensorFlow and Keras commands that yuu will need to write basic training jobs on Kubeflow.


Using Kubeflow Katib to optimize model hyperparameters

 Katib is a framework running multiple instances of the same job with differing inputs, such as in neural architecture search ( for determining the right number and size of layers in a neural network) and hyperparameter search (finding the right learning rate, for example, for an algorithm). Like the other Kustomize templates we have seen, the TensorFlow job specifies a generic TensorFlow job, with placeholders for the parameters:


    apiVersion: "kubeflow.org/v1alpha3"

    kind: Experiment

    metadata:

        namespcae: kubeflow

        name: tfjob-example

    spec:

        parallelTrialCount: 3

        maxTrialCount: 12

        maxFaildTrialCount: 3

        objective:

            type: maximize

            goal: 0.99

            objectiveMetricName: accuracy_1

        algorithm:

            glgorithmName: random

        metricsCollectorSpec:

            source:

                fileSystemPath:

                    path: /train

                    kind: Directory

                collector:

                    king: TensorFlowEvent

            parameters:

                -name: --learning_rate

                parameterType: double

                feasibleSpace:

                    min: "0.01"

                    max: "0.05"

                -name: --batch_size

                parameterType: int

                feasibleSpce:

                    min: "100"

                    max: "200"

            trialTemplate:

                goTemplate:

                    rowTemplate: | -

                        apiVersion: "kubeflow.ortg/v1"

                        kind: TFJob

                        metadata:

                            name: {{.Trial}}

                            namespcae: {{.NameSpcae}}

                        spec:

                            tfReplicas: 1

                            restartPolicy: OnFailure

                            template:

                                spec:

                                    containers:

                                        -name: tensorflow

                                        image: gcr.io/kubeflow-ci/tf-manist-with-summaries:1.0

                                        imagePullPolicy: Always

                                        command:

                                            -"python"

                                            -"/var/tf_mnist/mnist_with_summaries.py"

                                            -"--log_dir=/train/metrics"

                                            {{- with .HyperParameters}}

                                            {{- range .}}

                                            - "{{.Name}}-{{.Value}}"

                                            {{- end}}

                                            {{- end}}

Which we can run using the familiar kubectl syntax:

kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alph3/tfjob-example.yaml


of though the UI

where you can see a visual of the outcome of these multi-parameter experiments, or a table.



Kubeflow pipelines

 For notebook servers, we gave an example of a single container (this notebook instace) application. Kubeflow also gives us the ability to run multi-container application worksflows(such as input data, training, and deployment) using the piplines functionality. Pipelines are Python functions that follow a Domain Specific Language(DSL) to specify components that will be compiled into containers.

If we click piplies on the UI, we are brought to a dashboard

Selecting one on these pipelines, we can ses a visual overview of the component containers


After create a new run, we can specify parameters for a particular instace of this  pipeline.

Once the pipeline is created, we can use the user interface to visualize the results.

Under the hood, the Python code to generate this pipline is compiled using the pipelines SDK. We could specify the components to come either from a container with Python code:


@kfp.dsl.componet

def my_component(my_pram):

    ...

    return kfp.dsl.ContainerOp(

        name='My componet name',

        image='gcr.io/path/to/container/image'

    )

    or a function written in Python itself:

    @kfp.dsl.python_component(

        name='My awesome component',

        description='Come and play',

    )

    def my_python_func(a: str, b: str) -> str:


For a pure Python function, we could turn this into an operation with the compiler:

my_op    =    compiler.build_python_component(

        component_func=my_python_func,

        staging_gcs_path=OUTPUT_DIR,

        target_imge=TARGET_IMAGE)


We then use the dsl.pipeline decorator to add this operation to a pipeline:

    @kfp.dsl.pipeline(

        name='My pipeline',

        description='My machine learning pipline'

    )

    def my_pipline(param_1: PipelineParam, param_2: PipelineParam):

        my_step = my_op(a='a', b='b')


We compile it using the following code:

    kfp.compiler.Compiler().compile(my_pipeline, 'my-pipeline.zip')

and run it with this code:

    client = ktf.Client()

    my_experiment = client.create_experiment(name='demo')

    my_run=client.run_pipeline(my_experiment.id, 'my-pipelie', 'my-pipeline.zip')

We can also upload this ZIP file to the pipelines UI, where Kubeflow can use the generated YAML, from compilation to instantiate the job.

Now that you have seen the process for generating results for a single pipeline, our next problem is how to generate the optimal parameters for such a pipeline. As you will see in Chapter 3, Building Blocks of Deep Neural Networks, neural network models typically have a number of layers, layer size, and connectivity) and training paradigm (such as learning rate and optimizer algorithm). Kubeflow has a built-in utility for optimizing models for such parameter grids, called Katib.