페이지

2022년 2월 26일 토요일

Summary

 In this chapter, we have covered an overview of what TensorFlow is and how it serves and an improvement over earlier frameworks for deep learning research.

We also explored setting up an IDE, VSCode, and the foundation of reproducible applications, Docker containers. To orchestrate and deploy Docker containers, we discussed the Kubernetes framework, and how we can scale groups of containers using its API. Finally, I described Kubeflow, a maching learning framework built on Kubernetes which allows us to run end-to-end pipelines, distributed training. and parameter search, and serve trained models. We then set up a Kubeflow deployment using Terraform, an IaaS technology.

Before jumping into specific projects, we will enxt cover the basics of neural network theory and the TensorFlow and Keras commands that yuu will need to write basic training jobs on Kubeflow.


Using Kubeflow Katib to optimize model hyperparameters

 Katib is a framework running multiple instances of the same job with differing inputs, such as in neural architecture search ( for determining the right number and size of layers in a neural network) and hyperparameter search (finding the right learning rate, for example, for an algorithm). Like the other Kustomize templates we have seen, the TensorFlow job specifies a generic TensorFlow job, with placeholders for the parameters:


    apiVersion: "kubeflow.org/v1alpha3"

    kind: Experiment

    metadata:

        namespcae: kubeflow

        name: tfjob-example

    spec:

        parallelTrialCount: 3

        maxTrialCount: 12

        maxFaildTrialCount: 3

        objective:

            type: maximize

            goal: 0.99

            objectiveMetricName: accuracy_1

        algorithm:

            glgorithmName: random

        metricsCollectorSpec:

            source:

                fileSystemPath:

                    path: /train

                    kind: Directory

                collector:

                    king: TensorFlowEvent

            parameters:

                -name: --learning_rate

                parameterType: double

                feasibleSpace:

                    min: "0.01"

                    max: "0.05"

                -name: --batch_size

                parameterType: int

                feasibleSpce:

                    min: "100"

                    max: "200"

            trialTemplate:

                goTemplate:

                    rowTemplate: | -

                        apiVersion: "kubeflow.ortg/v1"

                        kind: TFJob

                        metadata:

                            name: {{.Trial}}

                            namespcae: {{.NameSpcae}}

                        spec:

                            tfReplicas: 1

                            restartPolicy: OnFailure

                            template:

                                spec:

                                    containers:

                                        -name: tensorflow

                                        image: gcr.io/kubeflow-ci/tf-manist-with-summaries:1.0

                                        imagePullPolicy: Always

                                        command:

                                            -"python"

                                            -"/var/tf_mnist/mnist_with_summaries.py"

                                            -"--log_dir=/train/metrics"

                                            {{- with .HyperParameters}}

                                            {{- range .}}

                                            - "{{.Name}}-{{.Value}}"

                                            {{- end}}

                                            {{- end}}

Which we can run using the familiar kubectl syntax:

kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alph3/tfjob-example.yaml


of though the UI

where you can see a visual of the outcome of these multi-parameter experiments, or a table.



Kubeflow pipelines

 For notebook servers, we gave an example of a single container (this notebook instace) application. Kubeflow also gives us the ability to run multi-container application worksflows(such as input data, training, and deployment) using the piplines functionality. Pipelines are Python functions that follow a Domain Specific Language(DSL) to specify components that will be compiled into containers.

If we click piplies on the UI, we are brought to a dashboard

Selecting one on these pipelines, we can ses a visual overview of the component containers


After create a new run, we can specify parameters for a particular instace of this  pipeline.

Once the pipeline is created, we can use the user interface to visualize the results.

Under the hood, the Python code to generate this pipline is compiled using the pipelines SDK. We could specify the components to come either from a container with Python code:


@kfp.dsl.componet

def my_component(my_pram):

    ...

    return kfp.dsl.ContainerOp(

        name='My componet name',

        image='gcr.io/path/to/container/image'

    )

    or a function written in Python itself:

    @kfp.dsl.python_component(

        name='My awesome component',

        description='Come and play',

    )

    def my_python_func(a: str, b: str) -> str:


For a pure Python function, we could turn this into an operation with the compiler:

my_op    =    compiler.build_python_component(

        component_func=my_python_func,

        staging_gcs_path=OUTPUT_DIR,

        target_imge=TARGET_IMAGE)


We then use the dsl.pipeline decorator to add this operation to a pipeline:

    @kfp.dsl.pipeline(

        name='My pipeline',

        description='My machine learning pipline'

    )

    def my_pipline(param_1: PipelineParam, param_2: PipelineParam):

        my_step = my_op(a='a', b='b')


We compile it using the following code:

    kfp.compiler.Compiler().compile(my_pipeline, 'my-pipeline.zip')

and run it with this code:

    client = ktf.Client()

    my_experiment = client.create_experiment(name='demo')

    my_run=client.run_pipeline(my_experiment.id, 'my-pipelie', 'my-pipeline.zip')

We can also upload this ZIP file to the pipelines UI, where Kubeflow can use the generated YAML, from compilation to instantiate the job.

Now that you have seen the process for generating results for a single pipeline, our next problem is how to generate the optimal parameters for such a pipeline. As you will see in Chapter 3, Building Blocks of Deep Neural Networks, neural network models typically have a number of layers, layer size, and connectivity) and training paradigm (such as learning rate and optimizer algorithm). Kubeflow has a built-in utility for optimizing models for such parameter grids, called Katib.

Kubeflow notebook servers

 We can use Kubeflow to start a Jupyter notebook server in a namespace, where we can run experimental code; we can start the notebook by clicking the Notebook Server tab in the user interface and selecting NEW SERVER

We can then specify parameters, such as which container to run(which could include the TensorFlow container we examined earlier in our discussion of Kocker), and how many resources to allocate.


You can also specify a Persistent Volumn(PV) to store data that remains even if the notebook server is turned off, and special resources such as GPUs.

Once started, if you have specified a container with TensorFlow resources, you cna begin running models in the notebook server.

A brief tour of Kubeflow's components

 Now that we have installed Kubeflow locally or in the cloud, let us take a look aganin at the Kubeflow dashboard

Let's walk through what is available in this toolkit. First, notice in the upper pannel we have a dropdown with the name anonymous specified-this is the namepsce for Kubernetes referred to earlier. While our default is anonymous, we could create several namespaces on our Kubeflow instance to accommodate different users or projects. This can be done at login, where we set up a profile

Alternatively, as with other operations in Kubernetes, we can apply a namespace using a YAML file:

apiVersion: kubeflow.org/v1beta1

kind: Profile

metadata:

    name: profileName

spec:

    owner:

        kind: User

        name: userid@eamil.com

Using the kubectl command:

kubectl create -f profile.yaml

What can we do once we have a namespace? Let us look through the available tools.

Installing Kubeflow using Terraform

 For each of these cloud providers, you'll probably notice that we have a common set of commands; creating a Kubenetes cluster, installing Kubeflow, and starting the application. While we can use scripts to automate this process, if would be desirable to, like our code, have a way to version control and persist different infrastructure configurations, allowing a reproducible recipe for creaating the set of resources we need to urn Kubeflow. If would also help us potentially move between cloud providers without completely rewriting our installation logic.

The template language Terraform(https://www.terraform.io/)was created by HashiCorp as a tool for Infrastructure as Service(IaaS). In the same way that Kubernetes has an API to update resources on a cluster, Terraform allows us to abstract interactions with different underlying cloud providers using an API and a template language using a command-line utility and core components written in GoLang(Figure 2.7). Terraform can be extended using user-written plugins.

Terraform Core  <---->     Providers

                       RPC       Provisioners                        Upstream APIs

                                      Plugins

                                    

                                    Client Library       

Let's look at one example of installing Kubeflow using Terraform instuctions on AWS, located at https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow. Once you have established the required AWS resources and installed terraform on an EC2 container, the aws-eks-cluster-and-nodegroup. tf Terraform file is used to create the Kubeflow cluster using the command:

terraform apply

In this file are a few key components. One is variables that specify aspects of the deployment:

variable "efs_throughput_mode" {

    description = "EFS performance mode"

    default = "burstring"

    type = string

}

Another is specification for which cloud provider we are using:

provider "aws" {

    region    =    var.region

    shared_credentials_file    = var.credentials 

    resrouce "aws_eks_cluster"    "eks_cluster" {

        name    =    var.cluster_name

        role_arn    =    aws_iam_role.cluster.role.arn

        version     =    var.k8s_version


    vpc_config {

        security_group_ids    =    [aws_security_group.cluster_sg.id]

        subnet_ids    =    flatten([aws_subnet.subnet.*.id])

    }

    depends_on    = [

        aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy,

        aws_iam_role_policy_attachment.cluster_AmazonKSServicePolicy,

    ]

    provisioner    "local-exec"       {

        command    =    "aws --region ${var.region} eks update-kubeconfig --name ${aws_eks_cluster.eks_cluster.name}"

    }

    provisioner    "local-exec"    {

        when    =    destroy

        command    =    "kubectl config unset current-context"

    }

}

    profile    =    var.profile

}

And another is resources such as the EKS cluster:

resource    "aws_eks_cluster"    "eks_cluster"{

    name    =    var.cluster_name

    role_arn    =    aws_iam_role.cluster_role.arn

    version    =    var.k8s_version


    vpc_config {

        security_group_ids    =    [aws_security_group.cluster_sg.id]

        subnet_ids    =    flatten([aws_subnet.subnet.*.id])

    }

    depends_on    =    [

        aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy,

        aws_iam_role_policy_attachment.cluster_AmazonEKSServicePolicy,

    }

    provisioner    "local-exec"    {

        command    =    "aws --region ${var.region} eks update-kubeconfig --name ${aws_eks_cluster.eks_cluster.name}"

    }

    provisioner    "local-exec"    {

        when    =    destroy   

        command    =    "kubectl config unset current-context"

    }

}

Every time run the Terraform apply command, it walks through this to determine what resources to create, which underlying AWS services to call to create them, and with which set of configuration they should be provisioned. This provides a clean way to orchestrate complex installations such as Kubeflow in a versioned, extensible template language.

Now that we have successfully installed Kubeflow either locally or on a managed Kubernetes control plane in the cloud, let us take a look at what tools are abailable on the platform.


                            

2022년 2월 25일 금요일

Installing Kubeflow on Azure

 Azure is Microsoft Corporation's cloud offering, and like AWS and GCP, we can use it to install Kubeflow leveraging a Kubernetes control plane and computing resources residing in the Azure cloud.

1. Register an account on Azure

Sign up at https://azure.microsoft.com - a free tier is available for experimentaion.

2. Install the Azure command-line utilities

See instructions for installation on your platform at https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest. You can verify your installation by running the following on the command line on your machine:

az

This should print a list of commands that you can use on the console. To start.log in your account with:

az login

And enter the account credentials you registered in Step 1. You will be redirected to a browser to verify your account, after which you should see a response like the following:

"You have logged in. Now let us find all the subscription to which you have access": -

[

{

    "cloudName": ...

    "id"...,

...

    "user": {

...

}

}

]

3. Create the resource group for a new cluster

We first need to create the resource group where our new application will live, using the following command:

az group create -n ${RESOURCE_GROUP_NAME} -l ${LOCATION}

4. Create a Kubernetes resource on AKS

Now deploy the Kubernetes control plane on your resource group:

az aks create -g ${RESOURCE_GROUP_NAME} -n ${NAME} -s ${AGENT_SIZE} -c ${AGENT_COUINT} -l ${LOCATION} --generate-ssh-keys

5. Install Kubeflow

First, we need to obtain credentials to install Kubeflow on our AKS resource:

az aks get-credentials -n ${name} -g ${RESOURCE_GROUP_NAME}

6. Install kfctl

Install and unpack the tarball directory:

tar -xvf kfctl_v0.7.1_<platform>.tar.gz

7. Set environment variables

As with AWS, we need to enter values for a few key environment variables:

the application containing the Kubeflow configuration files (${KF_DIR}), the name of the Kubeflow deployment (${KF_NAME}), and the path to the base configuration URI (${CONFIG_URI}) - for Azure, this is https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_k8s_istio.0.0.1.yaml).

8. Launch Kubeflow

The same as AWS,we use Kustomize to build the template file and launch Kubeflow:

mkdir -p ${KF_DIR}

cd ${KF_DIR}

ktctl apply -V -f ${CONFIG_URI}

Once Kubeflow is launched, you can use port forwarding to redirect traffic from local port 8880 to port 80 in the cluster to access the Kubeflow dashboard at localhost:8080 using the following command:

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80