怎样做好网站用户体验,动效做的好的网站,网站建设销售工资,新开传奇网站大全前提条件
根据不同的操作系统#xff0c;安装好显卡驱动#xff0c;并能正常识别出来显卡#xff0c;比如如下截图#xff1a; GPU容器创建流程
containerd -- containerd-shim-- nvidia-container-runtime -- nvidia-container-runtime-hook -- libnvid…前提条件
根据不同的操作系统安装好显卡驱动并能正常识别出来显卡比如如下截图 GPU容器创建流程
containerd -- containerd-shim-- nvidia-container-runtime -- nvidia-container-runtime-hook -- libnvidia-container -- runc -- container-process
GPU驱动安装
# ubuntu系统apt-get update
apt-get install gcc make
## cuda10.1
wget -c https://ops-software-binary-1255440668.cos.ap-chengdu.myqcloud.com/nvidia/NVIDIA-Linux-x86_64-430.50.run
bash NVIDIA-Linux-x86_64-430.50.run
## cuda10.2
wget -c https://ops-software-binary-1255440668.cos.ap-chengdu.myqcloud.com/nvidia/NVIDIA-Linux-x86_64-440.100.run
bash NVIDIA-Linux-x86_64-440.100.run
## cuda11
wget -c https://ops-software-binary-1255440668.cos.ap-chengdu.myqcloud.com/nvidia/NVIDIA-Linux-x86_64-450.66.run
bash NVIDIA-Linux-x86_64-450.66.run
## cuda11.4
wget -c https://ops-software-binary-1255440668.cos.ap-chengdu.myqcloud.com/nvidia/NVIDIA-Linux-x86_64-470.57.02.run
bash NVIDIA-Linux-x86_64-470.57.02.run
安装nvidia runtime
https://nvidia.github.io/nvidia-container-runtime/# ubuntu在线安装curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
cat /etc/apt/sources.list.d/nvidia-docker.list EOF
deb https://nvidia.github.io/libnvidia-container/ubuntu16.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/ubuntu16.04/$(ARCH) /
EOF
apt-get update
apt-get install nvidia-container-runtime# centos 在线安装distribution$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
DIST$(sed -n s/releasever//p /etc/yum.conf)
DIST${DIST:-$(. /etc/os-release; echo $VERSION_ID)}
sudo rpm -e gpg-pubkey-f796ecb0
sudo gpg --homedir /var/lib/yum/repos/$(uname -m)/$DIST/nvidia-docker/gpgdir --delete-key f796ecb0
sudo yum makecache
yum -y install nvidia-container-runtime
配置docker/containerd
# docker配置cat /etc/docker/daemon.json{registry-mirrors: [https://wlzfs4t4.mirror.aliyuncs.com],max-concurrent-downloads: 10,log-driver: json-file,log-level: warn,log-opts: {max-size: 10m,max-file: 3},data-root: /data/var/lib/docker,bip: 169.254.31.1/24,default-runtime: nvidia,runtimes: {nvidia: {path: /usr/bin/nvidia-container-runtime,runtimeArgs: []}}
}systemctl restart docker# containerd配置cat /etc/containerd/config.toml#其他的根据自己的需求修改我这里只说明适配gpu的配置
[plugins][plugins.io.containerd.grpc.v1.cri][plugins.io.containerd.grpc.v1.cri.containerd]
#-------------------修改开始-------------------------------------------default_runtime_name nvidia
#-------------------修改结束-------------------------------------------[plugins.io.containerd.grpc.v1.cri.containerd.runtimes]
#-------------------新增开始-------------------------------------------[plugins.io.containerd.grpc.v1.cri.containerd.runtimes.nvidia] privileged_without_host_devices falseruntime_engine runtime_root runtime_type io.containerd.runc.v2[plugins.io.containerd.grpc.v1.cri.containerd.runtimes.nvidia.options]BinaryName /usr/bin/nvidia-container-runtime
#-------------------新增结束-------------------------------------------systemctl restart containerd.service
方案一使用nvidia官方插件
【根据显卡数量分配独占显卡】
应用yaml分配GPU资源示例
resources:limits:nvidia.com/gpu: 1requests:nvidia.com/gpu: 1
其中1表示使用1张GPU卡
在Kubernetes中启用GPU支持
# cat nvidia-device-plugin.yaml apiVersion: apps/v1
kind: DaemonSet
metadata:name: nvidia-device-plugin-daemonsetnamespace: kube-system
spec:selector:matchLabels:name: nvidia-device-plugin-dsupdateStrategy:type: RollingUpdatetemplate:metadata:labels:name: nvidia-device-plugin-dsspec:tolerations:- key: nvidia.com/gpuoperator: Existseffect: NoSchedule# Mark this pod as a critical add-on; when enabled, the critical add-on# scheduler reserves resources for critical add-on pods so that they can# be rescheduled after a failure.# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/priorityClassName: system-node-criticalcontainers:- image: ycloudhub.com/middleware/nvidia-gpu-device-plugin:v0.12.3name: nvidia-device-plugin-ctrenv:- name: FAIL_ON_INIT_ERRORvalue: falsesecurityContext:allowPrivilegeEscalation: falsecapabilities:drop: [ALL]volumeMounts:- name: device-pluginmountPath: /var/lib/kubelet/device-pluginsvolumes:- name: device-pluginhostPath:path: /var/lib/kubelet/device-plugins# 应用yaml文件并检查kubectl apply -f nvidia-device-plugin.yml
kubectl get po -n kube-system | grep nvidiakubectl describe nodes ycloud
......
Capacity:cpu: 32ephemeral-storage: 458291312Kihugepages-1Gi: 0hugepages-2Mi: 0memory: 131661096Kinvidia.com/gpu: 2pods: 110
Allocatable:cpu: 32ephemeral-storage: 422361272440hugepages-1Gi: 0hugepages-2Mi: 0memory: 131558696Kinvidia.com/gpu: 2pods: 110
...... 方案二使用第三方插件
【根据显卡显存大小分配共享显卡】
# 阿里云官方git地址https://github.com/AliyunContainerService/gpushare-device-plugin/resources:limits:aliyun.com/gpu-mem: 3requests:aliyun.com/gpu-mem: 3# 其中3表示使用的显存大小,单位G 安装gpushare-scheduler-extender插件 参考文档 https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md 1.修改kube-scheduler配置
# 创建/etc/kubernetes/scheduler-policy-config.json{kind: Policy,apiVersion: v1,extenders: [{urlPrefix: http://127.0.0.1:32766/gpushare-scheduler,filterVerb: filter,bindVerb: bind,enableHttps: false,nodeCacheCapable: true,managedResources: [{name: aliyun.com/gpu-mem,ignoredByScheduler: false}],ignorable: false}]
}# 修改cat /etc/systemd/system/kube-scheduler.service文件添加--policy-config-file相关内容cat /etc/systemd/system/kube-scheduler.service[Unit]
DescriptionKubernetes Scheduler
Documentationhttps://github.com/GoogleCloudPlatform/kubernetes
[Service]
ExecStart/usr/local/bin/kube-scheduler \--address127.0.0.1 \--masterhttp://127.0.0.1:8080 \--leader-electtrue \--v2 \--policy-config-file/etc/kubernetes/scheduler-policy-config.json
Restarton-failure
RestartSec5
[Install]
WantedBymulti-user.target# 重启服务systemctl daemon-reload
systemctl restart kube-scheduler.service
2. 部署gpushare-schd-extender
curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yamlkubectl apply -f gpushare-schd-extender.yaml
3.部署device-plugin
# 给节点添加label gpusharetruekubectl label node target_node gpusharetrue
For example:
kubectl label node mynode gpusharetrue# 部署device-plugin插件wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yamlkubectl apply -f device-plugin-rbac.yamlwget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yamlkubectl apply -f device-plugin-ds.yaml
4.安装kubectl-inspect-gpushare插件用来查看GPU使用情况
cd /usr/bin/wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpusharechmod ux /usr/bin/kubectl-inspect-gpushare
以上内容仅供参考。