kubeflow初体验

  |   0 评论   |   0 浏览

背景

在ECS上搭建一个kubeflow的实验环境。

硬件配置

worker最小硬件

参见官网

  • 4 CPU
  • 50 GB storage
  • 12 GB memory

软件配置

k8s和kubeflow版本兼容性

部分摘自官网

Kubernetes VersionsKubeflow 1.0Kubeflow 1.1
1.16compatiblecompatible
1.17no known issuesno known issues
1.18no known issuesno known issues

本文选择:k8s 1.18 和 kubeflow 1.1

初始k8s环境

% kubectl get nodes
NAME                       STATUS   ROLES    AGE   VERSION
cn-hangzhou.192.168.2.63   Ready    <none>   18m   v1.18.8-aliyun.1
cn-hangzhou.192.168.2.64   Ready    <none>   18m   v1.18.8-aliyun.1
cn-hangzhou.192.168.2.65   Ready    <none>   18m   v1.18.8-aliyun.1
% kubectl get pods --all-namespaces

NAMESPACE     NAME                                                  READY   STATUS      RESTARTS   AGE
arms-prom     ack-prometheus-gpu-exporter-48wxq                     1/1     Running     0          14m
arms-prom     ack-prometheus-gpu-exporter-7tvlw                     1/1     Running     0          14m
arms-prom     ack-prometheus-gpu-exporter-n4h7q                     1/1     Running     0          14m
arms-prom     arms-prometheus-ack-arms-prometheus-88dbcf655-5jtvx   1/1     Running     0          14m
arms-prom     kube-state-metrics-bbb5b6f79-55pvl                    1/1     Running     0          14m
arms-prom     node-exporter-8sh74                                   2/2     Running     0          14m
arms-prom     node-exporter-kgpzx                                   2/2     Running     0          14m
arms-prom     node-exporter-zsf58                                   2/2     Running     0          14m
kube-system   ack-node-problem-detector-daemonset-4t44f             1/1     Running     0          14m
kube-system   ack-node-problem-detector-daemonset-9stt5             1/1     Running     0          14m
kube-system   ack-node-problem-detector-daemonset-mvbdg             1/1     Running     0          14m
kube-system   ack-node-problem-detector-eventer-684f5f9b86-wr5fc    1/1     Running     0          14m
kube-system   alibaba-log-controller-5bff6dcfd4-m48gd               1/1     Running     0          14m
kube-system   alicloud-application-controller-798784bf49-sqpnd      1/1     Running     0          14m
kube-system   alicloud-monitor-controller-7ff6c85c56-cz8d4          1/1     Running     0          14m
kube-system   aliyun-acr-credential-helper-59b6d6c858-wtd92         1/1     Running     0          14m
kube-system   coredns-78d4b8bd88-c98zc                              1/1     Running     0          14m
kube-system   coredns-78d4b8bd88-wfk2w                              1/1     Running     0          14m
kube-system   csi-plugin-2jh8c                                      4/4     Running     0          14m
kube-system   csi-plugin-gzmhg                                      4/4     Running     0          14m
kube-system   csi-plugin-prw6b                                      4/4     Running     0          14m
kube-system   csi-provisioner-5cbb9458b6-b66l4                      7/7     Running     0          14m
kube-system   csi-provisioner-5cbb9458b6-pvhb6                      7/7     Running     1          14m
kube-system   kube-eventer-init-w2cqt                               0/1     Completed   0          14m
kube-system   kube-flannel-ds-9nx9v                                 1/1     Running     0          14m
kube-system   kube-flannel-ds-cfb8d                                 1/1     Running     0          14m
kube-system   kube-flannel-ds-db2sj                                 1/1     Running     0          14m
kube-system   kube-proxy-worker-5j92s                               1/1     Running     0          14m
kube-system   kube-proxy-worker-d7ffv                               1/1     Running     0          14m
kube-system   kube-proxy-worker-ngv57                               1/1     Running     0          14m
kube-system   logtail-ds-cfcgk                                      1/1     Running     0          14m
kube-system   logtail-ds-kj9nr                                      1/1     Running     0          14m
kube-system   logtail-ds-tt4mm                                      1/1     Running     0          14m
kube-system   metrics-server-6446b68f74-zp28l                       1/1     Running     0          14m
kube-system   nginx-ingress-controller-548755f5d4-k8d48             1/1     Running     0          14m
kube-system   nginx-ingress-controller-548755f5d4-mc8m4             1/1     Running     0          14m
kube-system   nvidia-device-plugin-cn-hangzhou.192.168.2.63         1/1     Running     0          16m
kube-system   nvidia-device-plugin-cn-hangzhou.192.168.2.64         1/1     Running     0          16m
kube-system   nvidia-device-plugin-cn-hangzhou.192.168.2.65         1/1     Running     0          16m

kubeflow安装

下载kfctl

官网下载 kfctl 1.1.0 版本。

安装 kubeflow 1.0

本文沿用 Kubeflow 1.0 上线: 体验生产级的机器学习平台中的方法,快速安装环境。

初始化环境

export KF_NAME=my-kubeflow
export BASE_DIR=/root/
export KF_DIR=${BASE_DIR}/${KF_NAME}
export CONFIG_URI="http://kubeflow.oss-cn-beijing.aliyuncs.com/kfctl_k8s_istio.v1.0.1.yaml"
mkdir -p ${KF_DIR}
cd ${KF_DIR}
kfctl build -V -f ${CONFIG_URI}
export CONFIG=${KF_DIR}/kfctl_k8s_istio.v1.0.1.yaml

创建服务PV

cat << EOF > local_pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pipeline-mysql-pv
  namespace: kubeflow
  labels:
    type: local
    app: pipeline-mysql-pv
    key: kubeflow-pv
spec:
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/pipeline-mysql
    type: DirectoryOrCreate
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pipeline-minio-pv
  namespace: kubeflow
  labels:
    type: local
    app: pipeline-minio-pv
    key: kubeflow-pv
spec:
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/pipeline-minio
    type: DirectoryOrCreate
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: katib-mysql
  namespace: kubeflow
  labels:
    type: local
    app: katib-mysql
spec:
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/katib-mysql
    type: DirectoryOrCreate
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: metadata-mysql-pv
  namespace: kubeflow
  labels:
    type: local
    app: metadata-mysql-pv
    key: kubeflow-pv
spec:
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/metadata-mysql
    type: DirectoryOrCreate
EOF

执行命令

kubectl create -f local_pv.yaml

部署服务

kfctl apply -V -f ${CONFIG}

设置路由

在阿里云控制台上,添加一个路由,指定一个域名,映射到istio-system中的 istio-ingressgateway 的 80端口。

浏览器访问

使用浏览器访问设置路由时的域名,则可以显示 kubeflow 的欢迎页面。

kubeflow使用示例

在kubeflow的欢迎页面上,创建一个自己的namespace,然后进入控制台。

MNIST入门例子

来自kubeflow mnist示例

创建gpu版本的jupyter notebook

在Notebook Servers中,创建一个Notebook Server。

如果提示No default Storage Class is set. Can't create new Disks for the new Notebook. Please use an Existing Disk.,则执行:

% kubectl patch storageclass alicloud-disk-ssd -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
storageclass.storage.k8s.io/alicloud-disk-ssd patched

设定disk-ssd为default stroage class。

阿里云提供的资源镜像如下,本示例使用第二个镜像。

registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tensorflow-1.15.2-notebook-cpu:1.0.0
registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tensorflow-1.15.2-notebook-gpu:1.0.0
registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tensorflow-2.1.0-notebook-cpu:1.0.0
registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tensorflow-2.1.0-notebook-gpu:1.0.0

如果配置了Workspace Volume并且将alicloud-disk-ssd设为default stroage class,由于最低限制的原因,请将Size设置为20G。

如果Size值小于20G,或者不使用PV的话,镜像均启动不起来。

查看GPU信息

点击 CONNECT,登录Terminal,检查GPU环境:

tf-docker ~ > nvidia-smi
Wed Nov 18 11:43:25 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   31C    P0    39W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

MINIST训练

创建一个 python文件,示例代码如下:

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

import tensorflow as tf

x = tf.placeholder(tf.float32, [None, 784])

W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

y = tf.nn.softmax(tf.matmul(x, W) + b)

y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

train_step = tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)

sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

for _ in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print("Accuracy: ", sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

进行执行。

训练结果如下:

Accuracy:  0.9009

MNIST完整例子

  1. 进入Jupyter Notebook中的Terminal中,执行:
git clone https://gitee.com/flowaters/kubeflow_examples.git git_kubeflow-examples
  1. 打开 mnist/mnist_vanilla_k8s.ipynb,然后按说明操作。

Financial时间序列

FAQ

删除集群时报错cluster name不匹配

kfctl delete -f ./kfctl_k8s_istio.v1.0.1.yaml

如果遇到

cluster name doesn't match: KfDef(current-context) v.s. current-context(kubernetes)

需要修改本地的 kfctl_k8s_istio.v1.0.1.yaml 文件的开头如下,增加 clusterName字段。

metadata:
  annotations:
    kfctl.kubeflow.io/force-delete: "false"
  clusterName: kubernetes
  creationTimestamp: null
  namespace: kubeflow

创建集群时,报knative-install失败

错误信息如下:

Encountered error applying application knative-install:  (kubeflow.error): Code 500 with message: Apply.Run : error when creating "/tmp/kout055481745": Internal error occurred: failed calling webhook "config.webhook.serving.knative.dev": Post https://webhook.knative-serving.svc:443/config-validation?timeout=30s: no endpoints available for service "webhook"  filename="kustomize/kustomize.go:284"

解决方案如下:

kubectl delete validatingwebhookconfigurations config.webhook.serving.knative.dev validation.webhook.serving.knative.dev
kubectl delete mutatingwebhookconfigurations webhook.serving.knative.dev

API文档

参考