利用 nvidia/cuda 镜像简化大模型部署

发布于 2023-11-03  74 次阅读


安装 docker 与 docker-compose

Ubuntu:

参考文章

sudo apt install apt-transport-https ca-certificates curl software-properties-common gnupg lsb-release
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update -y
sudo apt-get install docker-ce docker-ce-cli docker-compose -y

CentOS:

sudo yum-config-manager --add-repo http://download.docker.com/linux/centos/docker-ce.repo
sudo yum clean all
sudo yum makecache
sudo yum install docker-ce docker-ce-cli docker-compose -y`

报错处理

  • Ubuntu 中如上方法操作时,可能在 lsb_release 命令中会报错 ModuleNotFoundError: No module named 'lsb_release',此时需要将系统完整的 lsb_release.py 文件拷贝到报错的目录文件下即可:
    sudo cp /usr/lib/python3/dist-packages/lsb_release.py /usr/bin/
  • 如果 docker-compose 安装后无法正常使用,推荐去 github release 下载最新版本,放在 /usr/bin目录下即可。
  • docker-compose up -d 报错error getting credentials - err: exit status GDBus.Error:org.freedesktop.DBus.Error.ServiceU:安装包:sudo apt install gnupg2 pass

调整用户权限

sudo groupadd docker
sudo gpasswd -a 你的用户名 docker
sudo service docker restart

然后记得重连下 SSH,或者重启终端。

安装 NVIDIA Container Toolkit

参考官方指南

Ubuntu:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
  && \
    sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

CentOS:

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum install -y nvidia-container-toolkit

安装后更新 docker 默认运行时:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

最后可以验证下:

docker run --gpus all nvidia/cuda:11.7.1-devel-ubuntu20.04 nvidia-smi

能正常输出即可。

编写 Dockerfile

...

!!!注意!!!

建议直接使用 pytorch 镜像……不用配置 conda,遇到报错先重启……

  • alipay_img
  • wechat_img
最后更新于 2023-11-04