阿里云ECS的GPU上安装NVIDIA驱动
背景
这里对于操作系统的版本,驱动的版本都是有要求的。本文仅实验了一种指定的版本。
初体验
安装系统
使用 CentOS 7.2,对应的阿里云镜像名称为:
centos_7_02_64_20G_alibase_20170818.vhd
试过阿里云自己的镜像,最终未果,换到了上面这个镜像。
进入系统后,需要升级相关的包:
yum install kernel-devel kernel-doc kernel-headers gcc dkms -y
yum upgrade
最终版本信息如下:
# uname -r
3.10.0-1160.6.1.el7.x86_64
安装驱动
下载驱动:
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/418.43/NVIDIA-Linux-x86_64-418.43.run
安装驱动:
./NVIDIA-Linux-x86_64-418.43.run --kernel-source-path=/usr/src/kernels/3.10.0-1160.6.1.el7.x86_64/ -k $(uname -r) --dkms -s
正常安装结果:
# ./NVIDIA-Linux-x86_64-418.43.run --kernel-source-path=/usr/src/kernels/3.10.0-1160.6.1.el7.x86_64/ -k $(uname -r) --dkms -s
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 418.43................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: nvidia-installer was forced to guess the X library path '/usr/lib64' and X module path '/usr/lib64/xorg/modules'; these paths were not queryable from the system. If X fails to find the NVIDIA X driver module, please install
the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.
查看NVIDIA-SMI
最终结果显示:
nvidia-smi
Wed Dec 9 18:04:43 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43 Driver Version: 418.43 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 |
| N/A 35C P0 37W / 300W | 0MiB / 16130MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
可以打开 Persistence-M,如下:
nvidia-smi --persistence-mode=1
安装python2.7 和 tensorflow 1.12
pip install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple
pip install --upgrade numpy setuptools -i https://mirrors.aliyun.com/pypi/simple
pip install tensorflow==1.12 -i https://mirrors.aliyun.com/pypi/simple
安装CUDA 9.0
在 https://developer.nvidia.com/cuda-90-download-archive 按提示选择
通过网络安装,选择下载包:http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-9.0.176-1.x86_64.rpm
sudo rpm -i cuda-repo-rhel7-9.0.176-1.x86_64.rpm
sudo yum clean all
然后修改源地址为aliyun,即修改文件 /etc/yum.repos.d/cuda.repo
,
改为:
[cuda]
name=cuda
baseurl=https://mirrors.aliyun.com/nvidia-cuda/rhel7/x86_64/
enabled=1
gpgcheck=1
gpgkey=https://mirrors.aliyun.com/nvidia-cuda/rhel7/x86_64/7fa2af80.pub
再安装cuda。
sudo yum install cuda-9-0
确认安装正常
$ /usr/local/cuda-9.0/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
设置环境变量
export PATH=$PATH:/usr/local/cuda-9.0/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-9.0/lib64/
ubuntu 16.04 xenial
如果是 ubuntu 16.04 xenial 环境的话,操作如下:
修改源地址 /etc/apt/sources.list.d/cuda.list
:
# deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 /
deb https://mirrors.aliyun.com/nvidia-cuda/ubuntu1604/x86_64 /
然后安装包:
apt-get install cuda-9-0
安装 cuDNN 7.1.4
参考:https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html
下载下来的文件为 cudnn-9.0-linux-x64-v7.1.tgz
,大小391MB。里面包含:
cuda/include/cudnn.h
cuda/NVIDIA_SLA_cuDNN_Support.txt
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.7
cuda/lib64/libcudnn.so.7.1.4
cuda/lib64/libcudnn_static.a
安装:
$ sudo cp cuda/include/cudnn*.h /usr/local/cuda/include
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
$ sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
安装GPU版本的tensorflow
pip install tensorflow-gpu==1.12.0 -i https://mirrors.aliyun.com/pypi/simple
确认安装成功
import tensorflow as tf
print(tf.test.is_gpu_available())
结果如下:
2020-12-12 10:23:17.895072: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-12-12 10:23:19.171864: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-12 10:23:19.172725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:07.0
totalMemory: 15.78GiB freeMemory: 15.48GiB
2020-12-12 10:23:19.172767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2020-12-12 10:23:19.717858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-12 10:23:19.717910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2020-12-12 10:23:19.717922: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2020-12-12 10:23:19.718083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 14980 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:07.0, compute capability: 7.0)
True