阿里云ECS的GPU上安装NVIDIA驱动

2020-12-13 | 0 评论 | 0 浏览

背景

这里对于操作系统的版本，驱动的版本都是有要求的。本文仅实验了一种指定的版本。

初体验

安装系统

使用 CentOS 7.2，对应的阿里云镜像名称为：

centos_7_02_64_20G_alibase_20170818.vhd

试过阿里云自己的镜像，最终未果，换到了上面这个镜像。

进入系统后，需要升级相关的包：

yum install kernel-devel kernel-doc kernel-headers gcc dkms -y
yum upgrade

最终版本信息如下：

# uname -r
3.10.0-1160.6.1.el7.x86_64

安装驱动

下载驱动：

wget https://us.download.nvidia.com/XFree86/Linux-x86_64/418.43/NVIDIA-Linux-x86_64-418.43.run

安装驱动：

./NVIDIA-Linux-x86_64-418.43.run --kernel-source-path=/usr/src/kernels/3.10.0-1160.6.1.el7.x86_64/ -k $(uname -r) --dkms -s

正常安装结果：

# ./NVIDIA-Linux-x86_64-418.43.run --kernel-source-path=/usr/src/kernels/3.10.0-1160.6.1.el7.x86_64/ -k $(uname -r) --dkms -s
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 418.43................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: nvidia-installer was forced to guess the X library path '/usr/lib64' and X module path '/usr/lib64/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install
         the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.

查看NVIDIA-SMI

最终结果显示：

nvidia-smi
Wed Dec  9 18:04:43 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43       Driver Version: 418.43       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   35C    P0    37W / 300W |      0MiB / 16130MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

可以打开 Persistence-M，如下：

nvidia-smi --persistence-mode=1

安装python2.7 和 tensorflow 1.12

pip install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple
pip install --upgrade numpy setuptools -i https://mirrors.aliyun.com/pypi/simple


pip install tensorflow==1.12 -i https://mirrors.aliyun.com/pypi/simple

安装CUDA 9.0

在 https://developer.nvidia.com/cuda-90-download-archive 按提示选择

通过网络安装，选择下载包：http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-9.0.176-1.x86_64.rpm

sudo rpm -i cuda-repo-rhel7-9.0.176-1.x86_64.rpm
sudo yum clean all

然后修改源地址为aliyun，即修改文件 /etc/yum.repos.d/cuda.repo，

改为：

[cuda]
name=cuda
baseurl=https://mirrors.aliyun.com/nvidia-cuda/rhel7/x86_64/
enabled=1
gpgcheck=1
gpgkey=https://mirrors.aliyun.com/nvidia-cuda/rhel7/x86_64/7fa2af80.pub

再安装cuda。

sudo yum install cuda-9-0

确认安装正常

$ /usr/local/cuda-9.0/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

设置环境变量

export PATH=$PATH:/usr/local/cuda-9.0/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-9.0/lib64/

ubuntu 16.04 xenial

如果是 ubuntu 16.04 xenial 环境的话，操作如下：

下载包：http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.0.176-1_amd64.deb

修改源地址 /etc/apt/sources.list.d/cuda.list：

# deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 /
deb https://mirrors.aliyun.com/nvidia-cuda/ubuntu1604/x86_64 /

然后安装包：

apt-get install cuda-9-0

安装 cuDNN 7.1.4

参考：https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html

参考：https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.1.4/prod/9.0_20180516/cudnn-9.0-linux-x64-v7.1

下载下来的文件为 cudnn-9.0-linux-x64-v7.1.tgz，大小391MB。里面包含：

cuda/include/cudnn.h
cuda/NVIDIA_SLA_cuDNN_Support.txt
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.7
cuda/lib64/libcudnn.so.7.1.4
cuda/lib64/libcudnn_static.a

安装：

$ sudo cp cuda/include/cudnn*.h /usr/local/cuda/include
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64

$ sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

安装GPU版本的tensorflow

pip install tensorflow-gpu==1.12.0 -i https://mirrors.aliyun.com/pypi/simple

确认安装成功

import tensorflow as tf
print(tf.test.is_gpu_available())

结果如下：

2020-12-12 10:23:17.895072: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-12-12 10:23:19.171864: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-12 10:23:19.172725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:07.0
totalMemory: 15.78GiB freeMemory: 15.48GiB
2020-12-12 10:23:19.172767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2020-12-12 10:23:19.717858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-12 10:23:19.717910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2020-12-12 10:23:19.717922: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2020-12-12 10:23:19.718083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 14980 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:07.0, compute capability: 7.0)
True

参考

CentOS7.4安装NVIDIA显卡驱动和CUDA8.0以及cuDNN5.1