Katib is a framework running multiple instances of the same job with differing inputs, such as in neural architecture search ( for determining the right number and size of layers in a neural network) and hyperparameter search (finding the right learning rate, for example, for an algorithm). Like the other Kustomize templates we have seen, the TensorFlow job specifies a generic TensorFlow job, with placeholders for the parameters:
apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
namespcae: kubeflow
name: tfjob-example
spec:
parallelTrialCount: 3
maxTrialCount: 12
maxFaildTrialCount: 3
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy_1
algorithm:
glgorithmName: random
metricsCollectorSpec:
source:
fileSystemPath:
path: /train
kind: Directory
collector:
king: TensorFlowEvent
parameters:
-name: --learning_rate
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.05"
-name: --batch_size
parameterType: int
feasibleSpce:
min: "100"
max: "200"
trialTemplate:
goTemplate:
rowTemplate: | -
apiVersion: "kubeflow.ortg/v1"
kind: TFJob
metadata:
name: {{.Trial}}
namespcae: {{.NameSpcae}}
spec:
tfReplicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
-name: tensorflow
image: gcr.io/kubeflow-ci/tf-manist-with-summaries:1.0
imagePullPolicy: Always
command:
-"python"
-"/var/tf_mnist/mnist_with_summaries.py"
-"--log_dir=/train/metrics"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}-{{.Value}}"
{{- end}}
{{- end}}
Which we can run using the familiar kubectl syntax:
kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alph3/tfjob-example.yaml
of though the UI
where you can see a visual of the outcome of these multi-parameter experiments, or a table.