微信公众号:进击的大杂烩
欢迎关注我,一起学习,一起进步!
KVM 虚拟机对Nvidia GeForce 显卡做穿透配置,与 Nvidia Tesla GPU 计算显卡穿透略配置有不同。
因为Nvidia的驱动中对Geforce的显卡做了检查,不允许在虚拟机中运行。所以我们将Geforce显卡做了直通的时候,驱动就会自己检查报错停止工作(Windows中安装完显卡驱动显示驱动异常43#错误的问题)。需要隐藏KVM以达到欺骗的目的。
环境
宿主机系统 内核 显卡型号 Qemu-kvm-ev 版本 穿透模式 CPU 虚拟机系统 CentOS Linux release 7.6.1810 (Core) 3.10.0-957.12.2.el7.x86_64 GeForce RTX 2080 Ti * 8块 qemu-kvm-ev-2.12.0-18.el7_6.5.1 iommu + vfio_pci Intel(R) windows 10 企业版 TLSC 64 位(版本:17763)
KVM 环境
开启 VT-D
BIOS 开启 VT-D 支持,不同服务器略有不同。
kvm 环境安装
yum -y install qemu-kvm libvirt virt-install
为了实现隐藏 KVM 的目的,需要在 xml 配置中添加如下配置:
<kvm>
<hidden state='on'/>
</kvm>
#此配置在默认 yum 安装的 qemu-kvm 版本中不支持,安装 qemu-kvm-ev
yum -y install centos-release-qemu-ev
yum install qemu-kvm-ev.x86_64
# qemu 配置 vnc 监听IP 和 vnc 密码:/etc/libvirt/qemu.conf
vnc_listen = "0.0.0.0"
vnc_password = "password"
# 启动服务
systemctl start libvirtd
内核开启 iommu 支持
实现用户空间设备驱动,最困难的在于如何将DMA以安全可控的方式暴露到用户空间:
提供DMA的设备通常可以写内存的任意页,因此使用户空间拥有创建DMA的能力就等同于用户空间拥有了root权限,恶意的设备可能利用此发动DMA攻击。
I/O memory management unit(IOMMU)的引入对设备进行了限制,设备I/O地址需要经过IOMMU重映射为内存物理地址。恶意的或存在错误的设备不能读写没有被明确映射过的内存,运行在cpu上的操作系统以互斥的方式管理MMU与IOMMU,物理设备不能绕行或污染可配置的内存管理表项。
iommu 分配的最小单元为 group
编辑 etc/default/grub 添加如下配置开启内核 iommu 支持
intel_iommu=on
#重新编译 grub
# 备份grub
cp /boot/grub2/grub.cfg ~/
# 重新编译 grub
grub2-mkconfig -o /boot/grub2/grub.cfg
#重启系统
reboot
挂载 vfio_pci 内核模块
VFIO是一个可以安全的把设备I/O、中断、DMA等暴露到用户空间(userspace),从而可以在用户空间完成设备驱动的框架。用户空间直接设备访问,虚拟机设备可以获得更高的IO性能。
# 加载 vfio_pci 模块modprobe vfio_pci
# 开机加载 vfio_pci 模块
echo vfio_pci > /etc/modules-load.d/vfio_pci.conf
显卡设备
# 查看所有显卡
lspci -nn | grep NVIDIA
输出如下
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
04:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f7] (rev a1)
04:00.2 USB controller [0c03]: NVIDIA Corporation Device [10de:1ad6] (rev a1)
04:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device [10de:1ad7] (rev a1)
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
05:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f7] (rev a1)
05:00.2 USB controller [0c03]: NVIDIA Corporation Device [10de:1ad6] (rev a1)
05:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device [10de:1ad7] (rev a1)
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
08:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f7] (rev a1)
08:00.2 USB controller [0c03]: NVIDIA Corporation Device [10de:1ad6] (rev a1)
08:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device [10de:1ad7] (rev a1)
09:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
09:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f7] (rev a1)
09:00.2 USB controller [0c03]: NVIDIA Corporation Device [10de:1ad6] (rev a1)
09:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device [10de:1ad7] (rev a1)
85:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
85:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f7] (rev a1)
85:00.2 USB controller [0c03]: NVIDIA Corporation Device [10de:1ad6] (rev a1)
85:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device [10de:1ad7] (rev a1)
86:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
86:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f7] (rev a1)
86:00.2 USB controller [0c03]: NVIDIA Corporation Device [10de:1ad6] (rev a1)
86:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device [10de:1ad7] (rev a1)
89:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
89:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f7] (rev a1)
89:00.2 USB controller [0c03]: NVIDIA Corporation Device [10de:1ad6] (rev a1)
89:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device [10de:1ad7] (rev a1)
8a:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
8a:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f7] (rev a1)
8a:00.2 USB controller [0c03]: NVIDIA Corporation Device [10de:1ad6] (rev a1)
8a:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device [10de:1ad7] (rev a1)
通过如上输出可以看到有8块显卡,每块显卡包含4个功能。
iommu 最小的分配单元为 group。所有的 iommu group 都在目录 sys/kernel/iommu_groups/ 下。
iommu group
# 通过命令查看网卡设备属于哪个 iommu group
find /sys/kernel/iommu_groups/ -type l | grep 0000:04:00.0
输出如下
/sys/kernel/iommu_groups/44/devices/0000:04:00.0
可以看到设备属于 iommu group 44
# 查看该组包含多少设备
ll /sys/kernel/iommu_groups/44/devices/
输出如下
lrwxrwxrwx 1 root root 0 Jun 11 05:08 0000:04:00.0 -> ../../../../devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:08.0/0000:04:00.0
lrwxrwxrwx 1 root root 0 Jun 11 05:08 0000:04:00.1 -> ../../../../devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:08.0/0000:04:00.1
lrwxrwxrwx 1 root root 0 Jun 11 05:08 0000:04:00.2 -> ../../../../devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:08.0/0000:04:00.2
lrwxrwxrwx 1 root root 0 Jun 11 05:08 0000:04:00.3 -> ../../../../devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:08.0/0000:04:00.3
# 也可以通过如下命令查看显卡的 iommu group
virsh nodedev-dumpxml pci_0000_04_00_0
输出如下:
<device>
<name>pci_0000_04_00_0</name>
<path>/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:08.0/0000:04:00.0</path>
<parent>pci_0000_03_08_0</parent>
<driver>
<name>vfio-pci</name>
</driver>
<capability type='pci'>
<domain>0</domain>
<bus>4</bus>
<slot>0</slot>
<function>0</function>
<product id='0x1e04' />
<vendor id='0x10de'>NVIDIA Corporation</vendor>
<iommuGroup number='44'>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x1'/>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x2'/>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x3'/>
</iommuGroup>
<numa node='0'/>
<pci-express>
<link validity='cap' port='8' speed='8' width='16'/>
<link validity='sta' speed='8' width='16'/>
</pci-express>
</capability>
</device>
通过上面的输出可以看到一个显卡的4个功能被分配到了一个 iommu 分组。设备所在的 numa node 为 0。
所以穿透时要同时对显卡的4个功能做穿透,不能只对 vga 做穿透。
vfio 设备与 iommu group number 对应,设备目录:/dev/vfio/
比如设备 pci_0000_04_00_0 的 iommu group number 为 44,则对应的设备为 dev/vfio/44。
ll /dev/vfio/44
crw------- 1 root root 245, 4 Jun 10 21:33 dev/vfio/44
配置虚拟机 xml 文件添加显卡设备
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
</source>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x1'/>
</source>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x2'/>
</source>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x3'/>
</source>
</hostdev>
启动虚拟机,通过 vnc 连接虚拟机安装显卡驱动。
# 通过如下命令可以查看显卡当前的驱动
lspci -s 04:00.0 -v | grep driver
输出为:
Kernel driver in use: vfio-pci
说明显卡通过 vfio-pci 内核模块驱动
多显卡穿透
宿主机有 8 块显卡,想每个虚拟机穿透 4 块显卡,此时会出现 1 - 2 个显卡驱动无法加载的情况。需要在设备穿透xml配置中手工配置 address。一般情况下无需手动指定,libvirt 会自助生成。当前情况下自动生成的设备被当做了独立设备,此时手工配置每个显卡为一个 pci 设备。每个设备包含 4 个功能
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x85' slot='0x00' function='0x0'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0' multifunction='on'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x85' slot='0x00' function='0x1'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x1'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x85' slot='0x00' function='0x2'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x2'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x85' slot='0x00' function='0x3'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x3'/>
</hostdev>
NUMA 架构
numa 将物理逻辑 cpu 和 内存进行了分组(node)。每个node内部有自己的CPU总线和内存,如果跨不同的Node访问内存的话,就会导致一个node中的CPU去访问另外一个node中的内存的情况,这就导致内存访问延迟的增加。
# 安装 numactl
yum -y install numactl
# 查看当前 numa node
numactl --hardware
输出如下:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 28 29 30 31 32 33 34 35 36 37 38 39 40 41
node 0 size: 130947 MB
node 0 free: 1417 MB
node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 1 size: 131072 MB
node 1 free: 68 MB
node distances:
node 0 1
0: 10 21
1: 21 10
宿主机 numa 将 cpu 和内存分为了 2 个 node 。
vCPU 绑定物理逻辑 CPU
通常情况下 kvm 虚拟机的每个 vcpu 可以调度到每个物理逻辑 cpu。如果一个虚拟机的vCPU跨不同的Node的话,就会导致一个node中的CPU去访问另外一个node中的内存的情况,这就导致内存访问延迟的增加。为了避免这种情况将虚拟机的 vcpu 要绑定到同一个 node 的逻辑 cpu。
# 通过如下命令可以查看虚拟机 vcpu 和宿主机逻辑核的调度关系
virsh vcpuinfo win10-gpu-01
# 通过如下命令查看虚拟机可以使用哪些物理逻辑 CPU
virsh emulatorpin win10-gpu-01
# 通过如下命令可以热绑定或修改虚拟机可以使用的物理逻辑 CPU
virsh emulatorpin win10-gpu-01 0-13 --live
# 通过如下命令可以实现 vcpu 和物理逻辑 CPU 1 对 1 的绑定 virsh vcpupin vmdom vcpu lcpu
virsh vcpupin win10-gpu-01 0 0
也可以通过 xml 配置 cpu 绑定
<cputune>
<vcpupin vcpu='0' cpuset='0'/>
<vcpupin vcpu='1' cpuset='1'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='3'/>
<vcpupin vcpu='4' cpuset='4'/>
<vcpupin vcpu='5' cpuset='5'/>
<vcpupin vcpu='6' cpuset='6'/>
<vcpupin vcpu='7' cpuset='7'/>
<vcpupin vcpu='8' cpuset='8'/>
<vcpupin vcpu='9' cpuset='9'/>
<vcpupin vcpu='10' cpuset='10'/>
<vcpupin vcpu='11' cpuset='11'/>
<vcpupin vcpu='12' cpuset='12'/>
<vcpupin vcpu='13' cpuset='13'/>
</cputune>
设备的 numa node
设备也会区分 numa node 节点。在上面查看设备 iommu 分组时可以看到显卡的 numa node 节点为 0。
也可以通过文件的方式查看设备所属 numa node:/sys/bus/pci/devices/<设备编号>/numa_node
# 查看显卡 04:00.0 所属 numa node
cat /sys/bus/pci/devices/0000\:04\:00.0/numa_node
在做穿透时将属于同一个 numa node 设备穿透到同一个虚拟机。
一个完整的 xml 示例
<domain type='kvm'>
<name>win10-gpu-01</name>
<uuid>b0eb51c0-93d4-a1c7-e5df-457f1bc8c30c</uuid>
<memory unit='MiB'>65536</memory>
<vcpu placement='static'>14</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='0'/>
<vcpupin vcpu='1' cpuset='1'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='3'/>
<vcpupin vcpu='4' cpuset='4'/>
<vcpupin vcpu='5' cpuset='5'/>
<vcpupin vcpu='6' cpuset='6'/>
<vcpupin vcpu='7' cpuset='7'/>
<vcpupin vcpu='8' cpuset='8'/>
<vcpupin vcpu='9' cpuset='9'/>
<vcpupin vcpu='10' cpuset='10'/>
<vcpupin vcpu='11' cpuset='11'/>
<vcpupin vcpu='12' cpuset='12'/>
<vcpupin vcpu='13' cpuset='13'/>
</cputune>
<os>
<type arch='x86_64' machine='pc'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
<hyperv>
<relaxed state='off'/>
<vapic state='off'/>
<spinlocks state='off'/>
</hyperv>
<kvm>
<hidden state='on'/>
</kvm>
</features>
<cpu mode='host-model'>
<model fallback='allow'/>
<topology sockets='1' cores='14' threads='1'/>
</cpu>
<clock offset='localtime'>
<timer name='hypervclock' present='no'/>
</clock>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<devices>
<emulator>/usr/libexec/qemu-kvm</emulator>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none' io='native'/>
<source file='/data/kvm/vms/win10-gpu-13/win10-gpu-13.vda.qcow2'/>
<target dev='vda' bus='virtio'/>
</disk>
<controller type='usb' index='0' model='ich9-ehci1'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x7'/>
</controller>
<controller type='usb' index='0' model='ich9-uhci1'>
<master startport='0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0' multifunction='on'/>
</controller>
<controller type='usb' index='0' model='ich9-uhci2'>
<master startport='2'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x1'/>
</controller>
<controller type='usb' index='0' model='ich9-uhci3'>
<master startport='4'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x2'/>
</controller>
<controller type='virtio-serial' index='0'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
</controller>
<memballoon model='virtio'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/>
</memballoon>
<!-- 1 -->
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0' multifunction='on'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x1'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x1'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x2'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x2'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x3'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x3'/>
</hostdev>
<!-- 2 -->
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0' multifunction='on'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x05' slot='0x00' function='0x1'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x1'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x05' slot='0x00' function='0x2'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x2'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x05' slot='0x00' function='0x3'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x3'/>
</hostdev>
<interface type='bridge'>
<mac address='52:54:00:ee:1d:a0'/>
<source bridge='br0'/>
<model type='virtio'/>
</interface>
<serial type='pty'>
<target port='0'/>
</serial>
<console type='pty'>
<target type='serial' port='0'/>
</console>
<input type='tablet' bus='usb'/>
<input type='mouse' bus='ps2'/>
<graphics type='vnc' port='5913' autoport='no'/>
<video>
<model type='cirrus' vram='9216' heads='1'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</video>
</devices>
</domain>
参考
Nvidia 官方 GPU 穿透文档:https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#using-gpu-pass-through-red-hat-el-kvm
kvm 虚拟机 xml 配置说明:https://libvirt.org/formatdomain.html
关于 iommu 和 vfio 更多详细信息可以查看:https://zhuanlan.zhihu.com/p/27026590




