kubeflow初体验
背景
在ECS上搭建一个kubeflow的实验环境。
硬件配置
worker最小硬件
参见官网
- 4 CPU
- 50 GB storage
- 12 GB memory
软件配置
k8s和kubeflow版本兼容性
部分摘自官网:
Kubernetes Versions | Kubeflow 1.0 | Kubeflow 1.1 |
---|---|---|
1.16 | compatible | compatible |
1.17 | no known issues | no known issues |
1.18 | no known issues | no known issues |
本文选择:k8s 1.18 和 kubeflow 1.1
初始k8s环境
% kubectl get nodes
NAME STATUS ROLES AGE VERSION
cn-hangzhou.192.168.2.63 Ready <none> 18m v1.18.8-aliyun.1
cn-hangzhou.192.168.2.64 Ready <none> 18m v1.18.8-aliyun.1
cn-hangzhou.192.168.2.65 Ready <none> 18m v1.18.8-aliyun.1
% kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
arms-prom ack-prometheus-gpu-exporter-48wxq 1/1 Running 0 14m
arms-prom ack-prometheus-gpu-exporter-7tvlw 1/1 Running 0 14m
arms-prom ack-prometheus-gpu-exporter-n4h7q 1/1 Running 0 14m
arms-prom arms-prometheus-ack-arms-prometheus-88dbcf655-5jtvx 1/1 Running 0 14m
arms-prom kube-state-metrics-bbb5b6f79-55pvl 1/1 Running 0 14m
arms-prom node-exporter-8sh74 2/2 Running 0 14m
arms-prom node-exporter-kgpzx 2/2 Running 0 14m
arms-prom node-exporter-zsf58 2/2 Running 0 14m
kube-system ack-node-problem-detector-daemonset-4t44f 1/1 Running 0 14m
kube-system ack-node-problem-detector-daemonset-9stt5 1/1 Running 0 14m
kube-system ack-node-problem-detector-daemonset-mvbdg 1/1 Running 0 14m
kube-system ack-node-problem-detector-eventer-684f5f9b86-wr5fc 1/1 Running 0 14m
kube-system alibaba-log-controller-5bff6dcfd4-m48gd 1/1 Running 0 14m
kube-system alicloud-application-controller-798784bf49-sqpnd 1/1 Running 0 14m
kube-system alicloud-monitor-controller-7ff6c85c56-cz8d4 1/1 Running 0 14m
kube-system aliyun-acr-credential-helper-59b6d6c858-wtd92 1/1 Running 0 14m
kube-system coredns-78d4b8bd88-c98zc 1/1 Running 0 14m
kube-system coredns-78d4b8bd88-wfk2w 1/1 Running 0 14m
kube-system csi-plugin-2jh8c 4/4 Running 0 14m
kube-system csi-plugin-gzmhg 4/4 Running 0 14m
kube-system csi-plugin-prw6b 4/4 Running 0 14m
kube-system csi-provisioner-5cbb9458b6-b66l4 7/7 Running 0 14m
kube-system csi-provisioner-5cbb9458b6-pvhb6 7/7 Running 1 14m
kube-system kube-eventer-init-w2cqt 0/1 Completed 0 14m
kube-system kube-flannel-ds-9nx9v 1/1 Running 0 14m
kube-system kube-flannel-ds-cfb8d 1/1 Running 0 14m
kube-system kube-flannel-ds-db2sj 1/1 Running 0 14m
kube-system kube-proxy-worker-5j92s 1/1 Running 0 14m
kube-system kube-proxy-worker-d7ffv 1/1 Running 0 14m
kube-system kube-proxy-worker-ngv57 1/1 Running 0 14m
kube-system logtail-ds-cfcgk 1/1 Running 0 14m
kube-system logtail-ds-kj9nr 1/1 Running 0 14m
kube-system logtail-ds-tt4mm 1/1 Running 0 14m
kube-system metrics-server-6446b68f74-zp28l 1/1 Running 0 14m
kube-system nginx-ingress-controller-548755f5d4-k8d48 1/1 Running 0 14m
kube-system nginx-ingress-controller-548755f5d4-mc8m4 1/1 Running 0 14m
kube-system nvidia-device-plugin-cn-hangzhou.192.168.2.63 1/1 Running 0 16m
kube-system nvidia-device-plugin-cn-hangzhou.192.168.2.64 1/1 Running 0 16m
kube-system nvidia-device-plugin-cn-hangzhou.192.168.2.65 1/1 Running 0 16m
kubeflow安装
下载kfctl
从官网下载 kfctl 1.1.0 版本。
安装 kubeflow 1.0
本文沿用 Kubeflow 1.0 上线: 体验生产级的机器学习平台中的方法,快速安装环境。
初始化环境
export KF_NAME=my-kubeflow
export BASE_DIR=/root/
export KF_DIR=${BASE_DIR}/${KF_NAME}
export CONFIG_URI="http://kubeflow.oss-cn-beijing.aliyuncs.com/kfctl_k8s_istio.v1.0.1.yaml"
mkdir -p ${KF_DIR}
cd ${KF_DIR}
kfctl build -V -f ${CONFIG_URI}
export CONFIG=${KF_DIR}/kfctl_k8s_istio.v1.0.1.yaml
创建服务PV
cat << EOF > local_pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: pipeline-mysql-pv
namespace: kubeflow
labels:
type: local
app: pipeline-mysql-pv
key: kubeflow-pv
spec:
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /data/pipeline-mysql
type: DirectoryOrCreate
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pipeline-minio-pv
namespace: kubeflow
labels:
type: local
app: pipeline-minio-pv
key: kubeflow-pv
spec:
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /data/pipeline-minio
type: DirectoryOrCreate
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: katib-mysql
namespace: kubeflow
labels:
type: local
app: katib-mysql
spec:
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /data/katib-mysql
type: DirectoryOrCreate
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: metadata-mysql-pv
namespace: kubeflow
labels:
type: local
app: metadata-mysql-pv
key: kubeflow-pv
spec:
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /data/metadata-mysql
type: DirectoryOrCreate
EOF
执行命令
kubectl create -f local_pv.yaml
部署服务
kfctl apply -V -f ${CONFIG}
设置路由
在阿里云控制台上,添加一个路由,指定一个域名,映射到istio-system中的 istio-ingressgateway 的 80端口。
浏览器访问
使用浏览器访问设置路由时的域名,则可以显示 kubeflow 的欢迎页面。
kubeflow使用示例
在kubeflow的欢迎页面上,创建一个自己的namespace,然后进入控制台。
MNIST入门例子
创建gpu版本的jupyter notebook
在Notebook Servers中,创建一个Notebook Server。
如果提示No default Storage Class is set. Can't create new Disks for the new Notebook. Please use an Existing Disk.
,则执行:
% kubectl patch storageclass alicloud-disk-ssd -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
storageclass.storage.k8s.io/alicloud-disk-ssd patched
设定disk-ssd为default stroage class。
阿里云提供的资源镜像如下,本示例使用第二个镜像。
registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tensorflow-1.15.2-notebook-cpu:1.0.0
registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tensorflow-1.15.2-notebook-gpu:1.0.0
registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tensorflow-2.1.0-notebook-cpu:1.0.0
registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tensorflow-2.1.0-notebook-gpu:1.0.0
如果配置了Workspace Volume并且将alicloud-disk-ssd设为default stroage class,由于最低限制的原因,请将Size设置为20G。
如果Size值小于20G,或者不使用PV的话,镜像均启动不起来。
查看GPU信息
点击 CONNECT,登录Terminal,检查GPU环境:
tf-docker ~ > nvidia-smi
Wed Nov 18 11:43:25 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:07.0 Off | 0 |
| N/A 31C P0 39W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
MINIST训练
创建一个 python文件,示例代码如下:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
import tensorflow as tf
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
for _ in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print("Accuracy: ", sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))
进行执行。
训练结果如下:
Accuracy: 0.9009
MNIST完整例子
- 进入Jupyter Notebook中的Terminal中,执行:
git clone https://gitee.com/flowaters/kubeflow_examples.git git_kubeflow-examples
- 打开
mnist/mnist_vanilla_k8s.ipynb
,然后按说明操作。
Financial时间序列
FAQ
删除集群时报错cluster name不匹配
kfctl delete -f ./kfctl_k8s_istio.v1.0.1.yaml
如果遇到
cluster name doesn't match: KfDef(current-context) v.s. current-context(kubernetes)
需要修改本地的 kfctl_k8s_istio.v1.0.1.yaml
文件的开头如下,增加 clusterName字段。
metadata:
annotations:
kfctl.kubeflow.io/force-delete: "false"
clusterName: kubernetes
creationTimestamp: null
namespace: kubeflow
创建集群时,报knative-install失败
错误信息如下:
Encountered error applying application knative-install: (kubeflow.error): Code 500 with message: Apply.Run : error when creating "/tmp/kout055481745": Internal error occurred: failed calling webhook "config.webhook.serving.knative.dev": Post https://webhook.knative-serving.svc:443/config-validation?timeout=30s: no endpoints available for service "webhook" filename="kustomize/kustomize.go:284"
解决方案如下:
kubectl delete validatingwebhookconfigurations config.webhook.serving.knative.dev validation.webhook.serving.knative.dev
kubectl delete mutatingwebhookconfigurations webhook.serving.knative.dev
API文档
- Kubeflow reference docs for guides to the Kubeflow Metadata API and SDK, the PyTorchJob CRD, and the TFJob CRD.
- Pipelines reference docs for the Kubeflow Pipelines API and SDK, including the Kubeflow Pipelines domain-specific language (DSL).
- Fairing reference docs for the Kubeflow Fairing SDK.