Instruction

手把手帶你學 Tensorflow、Pytorch、CUDA、NVIDIA-SMI、cuDNN、Ubuntu、NGC、Docker、NVIDIA-Docker 的安裝與設定

各家 (twccGCPAzureAWS) 雲端計價
[轉載] 雲端平台怎麼選?比較三大雲端供應商 GCP 與 AWS 與 Azure

https://github.com/Deep-Learning-101
https://huggingface.co/DeepLearning101

2016/06 自己添購(開箱) GIGABYTE GTX 960 4G * 2

2017/01 自己添購(開箱) 技嘉GTX1080 XTREME GAMING 8G

2018/05 公司投資添購(開箱) NVIDIA TITAN V + NVIDIA TITAN XP

2023/08 公司投資添購新設備 6000 Ada 48 GB * 2 和 A 100 80GB * 4

紅色字體為 2023/08/11 更新

[ 圖文並茂版 ]


首先是 https://en.wikipedia.org/wiki/CUDA#GPUs_supported 這個 wiki,裡面清楚寫了各種 GPU 的Micro-architecture,基本上建議如果真的要跑深度學習,請務必以 Pascal 為基本款出發 ! 記得在安裝前先切到 tty2 然後 sudo service lightdm stop 關閉服務。接著安裝相依套件:

sudo apt install dkms build-essential linux-headers-generic
然後在本機端前面按 ctrl + alt + F1,切換到 tty 2 記議直接在本機端前做安裝驅動這件事;不然這裡可能會碰上一些奇怪的錯誤。接著建議都先把安裝OS自帶的移除:sudo apt-get remove --purge nvidia* 或 dpkg -l 'nvidia*' 然後再 dpkg --remove nvidia-(XXX)。;或者知道預安裝的版本的就像下面這樣:

sudo apt-get remove --purge nvidia-driver-535 

sudo apt-get autoremove

確認了 GPU 型號後,就到這個網址挑選驅動程式下載吧,強烈建議看完全部想清楚再來挑選版本:https://www.nvidia.com.tw/Download/index.aspx?lang=tw

比如說我現在正要裝的是這個就這樣下載:wget http://tw.download.nvidia.com/XFree86/Linux-x86_64/440.82/NVIDIA-Linux-x86_64-440.82.run下載好後記得 chmod a+x 補上執行權限,接著 ./NVIDIA-Linux-x86_64-440.82.run -no-x-check -no-nouveau-check -no-opengl-files -no-x-check

wget https://tw.download.nvidia.com/XFree86/Linux-x86_64/535.104.05/NVIDIA-Linux-x86_64-535.104.05.run
sudo ./NVIDIA-Linux-x86_64-535.104.05.run

正常來說,都是照上面這樣安裝成功的,疏不知,就是有那個 BUT ... 會裝不起來

(1.) NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver.

https://blog.csdn.net/wjinjie/article/details/108997692

https://andy51002000.blogspot.com/2019/01/nvidia-smi-has-failed-because-it.html

https://blog.csdn.net/wjinjie/article/details/108997692

(2.) 【nvidia-smi】Failed to initialize NVML: Driver/library version mismatch

https://www.cnblogs.com/duby0/p/17060960.html

https://zhuanlan.zhihu.com/p/94378201

(3.) 安裝NVIDIA驅動出現: An NVIDIA kernel module 'nvidia-drm'
https://zhuanlan.zhihu.com/p/135875408
(4.) [Linux] Ubuntu 安裝、移除 NVIDIA 顯示卡驅動程式(Driver)教學
https://blog.csdn.net/ksws0292756/article/details/79160742
https://zhuanlan.zhihu.com/p/31575356
最後這次我是採取用 最後這次我是採取用 apt-get 安裝的 apt-get 安裝的

# sudo add-apt-repository ppa:graphics-drivers

# sudo apt-get update

# sudo apt-cache search nvidia-driver-*

# sudo apt-get install nvidia-driver-535
# sudo reboot

wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_12.2.2_535.104.05_linux.run
sh cuda_10.0.130_410.48_linux

wget https://developer.download.nvidia.com/compute/cuda/12.2.1/local_installers/cuda_12.2.2_535.104.05_linux.run
sudo sh cuda_12.2.2_535.104.05_linux.run
sudo apt install nvidia-cuda-toolkit

/Developer/NVIDIA/CUDA-#.#
Do you accept the previously read EULA?
accept/decline/quit: accept
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48? 選擇不安裝 cuda 自帶的 Driver
(y)es/(n)o/(q)uit: n
Install the CUDA 10.0 Toolkit? 安裝 cuda
(y)es/(n)o/(q)uit: y
Enter Toolkit Location
[ default is /usr/local/cuda-10.0 ]: 使用預設的 cuda 安裝路徑
Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y

Install the CUDA 10.0 Samples? 不安裝範例
(y)es/(n)o/(q)uit: n
Installing the CUDA Toolkit in /usr/local/cuda-10.0 ...
Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-10.0
Samples:  Not Selected

cd /usr/local/cuda-12.2/
sudo cp ~/cudnn-linux-x86_64-8.9.4.25_cuda12-archive/include/* /usr/local/cuda-12.2/include/
sudo cp ~/cudnn-linux-x86_64-8.9.4.25_cuda12-archive/lib/* /usr/local/cuda-12.2/lib64/

sudo apt-get install curl 

curl -sSL https://get.docker.com | sudo sh


ps faux | grep -i docker

sudo docker info
sudo systemctl stop docker

sudo rsync -avh /var/lib/docker/ /path/to/your/docker/


sudo vi /etc/docker/daemon.json
"data-root": "/path/to/your/docker", #加上這行

sudo systemctl daemon-reload


sudo systemctl start docker

sudo docker info

ps faux | grep -i docker

接著是要安裝 nvidia-docker2 (https://github.com/NVIDIA/nvidia-docker) 

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - 

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) 

curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update 

sudo apt-get install nvidia-docker2

安裝 nvidia-container-runtime :https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup

apt-get install nvidia-container-runtime 

service docker restart

到這邊就可以去NGC找個 image 進 docker 操作了,不是很建議直接在本機端做開發,因為很容易環境搞爛掉 XD

這個就是需要到anaconda官網下載 https://www.anaconda.com/distribution/

wget https://repo.anaconda.com/archive/Anaconda3-2023.07-2-Linux-x86_64.sh
chmod a+x Anaconda3-2023.07-2-Linux-x86_64.sh
./Anaconda3-2023.07-2-Linux-x86_64.sh 

如果做好一台的環境後可以這樣做,來匯入匯出避免反覆設定相關套件
匯出:conda env export --name yourenv --file yourenv.yml
匯入:conda env create -f yourenv.yml
檢查:conda info --envs
列出虛擬環境:conda env list
複製虛擬環境:conda create -n myEnvNameDes --clone myEnvNameSou
刪除虛擬環境:conda env remove --name myEnvName
刪除虛擬環境的module:conda remove --name myEnvName pandas

當然可以直接用 docker 那更方便啦

sudo nvidia-docker pull paddlecloud/paddleocr:2.6-gpu-cuda11.2-cudnn8-latest

conda install tensorflow==1.15.0

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

至於驗證版本等則是用這樣的指令 !

python -c "import torch; print(torch.__version__)" 

1.3.0 

python3 -c "import torch; print(torch.version.cuda)"

10.1.243

python3 -c "import torch; print(torch.backends.cudnn.version())"

7603

python -c "import torch; print(torch.cuda.is_available())" 

True

[Linux] Ubuntu 安裝、移除 NVIDIA 顯示卡驅動程式(Driver)教學

Pytorch DataParallel v.s. DistributedDataParallel
(Pytorch使用分散式訓練,單機多卡)

DataParallel


DistributedDataParallel (pytorch分布式数据并行DistributedDataParallel, DDP)