I was just interviewed by Frank Denneman for the Unexplored Territory podcast, where we talked about new computing hardware for running AI workloads on vSphere and VCF.
Hope you find this useful.
this must be the place
When adding a GPU to a vSphere VM using PCI passthrough there are a couple of additional settings that you need to make or your VM won’t boot.
When creating the VM you’ll need to set the Actions > Edit > VM Options > Boot Options > Firmware and select “EFI”. You need to do this before you install the operating system on the VM. If you don’t do this the GPUs won’t work and the VM won’t boot.
To add a GPU, in vCenter go to the VM, select Actions > Edit > Add New Device. Any GPUs set up as PCI passthrough devices should appear in a pick list. Add one or more GPUs to your VM.
Note that after adding one device, when you add additional GPUs the first GPU you selected still appears in the pick list. If you add the same GPU more than once your VM will not boot. If you add a GPU that’s being used by another running VM your VM will not boot. Pay attention to the PCI bus addresses displayed and make sure that the GPUs you pick are unique and not in use on another VM.
Finally you have to set up memory-mapped I/O (MMIO) to map system memory to the GPU’s framebuffer memory so that the CPU can pass data to the GPU. In vCenter go to the VM, select Actions > Edit > VM Options > Advanced > Edit configuration.
Once you’re on the Configuration parameters screen, add two more parameters:
pciPassthru.use64bitMMIO = TRUE
pciPassthru.64bitMMIOSizeGB = ????
The 64bitMMIOSizeGB value is calculated by adding up the total GB of framebuffer memory on all GPUs attached to the VM. If the total GPU framebuffer memory falls on a power-of-2, setting pciPassthru.64bitMMIOSizeGB to the next power of 2 works.
If the total GPU framebuffer memory falls between two powers-of-2, round up to the next power of 2, then round up again, to get a working setting.
Powers of 2 are 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024 …
For example, two NVIDIA A100 cards with 40GB each = 80GB (in between 64GB and 128GB), so round up to the next power of 2 (128GB), then round up again to the next power of 2 after that (256GB) to get the correct setting. If you set it too low the VM won’t boot, but it won’t give you an error message telling you what the issue is either.
Here are some configurations that I’ve tested and verified:
Hope you find this useful.
After writing my last article on Getting NVIDIA NGC containers to work with VMware PVRDMA networks I had a couple of people ask me “How do I set up PVRDMA networking on vCenter?” These are the steps that I took to set up PVRDMA networking in my lab.
RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. It works by encapsulating an Infiniband (IB) transport packet and sending it over Ethernet. If you’re working with network applications that require high bandwidth and low latency, RDMA will give you lower latency, higher bandwidth, and a lower CPU load than an API such as Berkeley sockets.
Full disclosure: I used to work for a startup called Bitfusion, and that startup was bought by VMware, so I now work for VMware. At Bitfusion we developed a technology for accessing hardware accelerators, such as NVIDIA GPUs, remotely across networks using TCP/IP, Infiniband, and PVRDMA. I still work on the Bitfusion product at VMware, and spend a lot of my time getting AI and ML workloads to work across networks on virtualized GPUs.
In my lab I’m using Mellanox Connect/X5 and ConnectX/6 cards on hosts that are running ESXi 7.0.2 and vCenter 7.0.2. The cards are connected to a Mellanox Onyx MSN2700 100GbE switch.
Since I’m working with Ubuntu 18.04 and 20.04 virtual machines (VMs) in a vCenter environment, I have a couple of options for high-speed networking:
First, make sure that your switch is set up correctly. On my Mellanox Onyx MSN2700 100GbE switch that means:
vCenter supports Paravirtual RDMA (PVRDMA) networking using Distributed Virtual Switches (DVS). This means you’re setting up a virtual switch in vCenter and you’ll connect your VMs to this virtual switch.
In vCenter navigate to Hosts and Clusters, then click the DataCenter icon (looks like a sphere or globe with a line under it). Find the cluster you want to add the virtual switch to, right click on the cluster and select Distributed Switch > New Distributed Switch.
Net.PVRDMAVmknic = "vmk0"
Repeat for each ESXi host.
pvrdma
and check the box to allow PVRDMA traffic through the firewall.Repeat for each ESXi host.
To enable jumbo frames a vCenter cluster using virtual switches you have to set MTU 9000 on the Distributed Virtual Switch.
For Ubuntu:
sudo apt-get install rdma-core infiniband-diags ibverbs-utils
In order for RDMA to work the vmw_pvrdma
module has to be loaded after several other modules. Maybe someone else knows a better way to do this, but the method that I got to work was adding a script /usr/local/sbin/rdma-modules.sh
to ensure that Infiniband modules are loaded on boot, then calling that from /etc/rc.local
so it gets executed at boot time.
#!/bin/bash
# rdma-modules.sh
# modules that need to be loaded for PVRDMA to work
/sbin/modprobe mlx4_ib
/sbin/modprobe ib_umad
/sbin/modprobe rdma_cm
/sbin/modprobe rdma_ucm
# Once those are loaded, reload the vmw_pvrdma module
/sbin/modprobe -r vmw_pvrdma
/sbin/modprobe vmw_pvrdma
Once that’s done just set up the PVRDMA network interface the same as any other network interface.
To verify that I’m getting something close to 100Gbps on the network I use the perftest
package.
To test bandwith I pick two VMs on different hosts. On one VM I run:
$ ib_send_bw --report_gbits
On the other VM I run the same command plus I add the IP address of the PVRDMA interface on the first machine:
$ ib_send_bw --report_gbits 192.168.128.39
That sends a bunch of data across the network and reports back:
So I’m getting an average of 96.31Gbps over the network connection.
I can also check the latency using the ib_send_lat
:
Hope you find this useful.