When creating the VM you’ll need to set the Actions > Edit > VM Options > Boot Options > Firmware and select “EFI”. You need to do this before you install the operating system on the VM. If you don’t do this the GPUs won’t work and the VM won’t boot.
To add a GPU, in vCenter go to the VM, select Actions > Edit > Add New Device. Any GPUs set up as PCI passthrough devices should appear in a pick list. Add one or more GPUs to your VM.
Note that after adding one device, when you add additional GPUs the first GPU you selected still appears in the pick list. If you add the same GPU more than once your VM will not boot. If you add a GPU that’s being used by another running VM your VM will not boot. Pay attention to the PCI bus addresses displayed and make sure that the GPUs you pick are unique and not in use on another VM.
Finally you have to set up memory-mapped I/O (MMIO) to map system memory to the GPU’s framebuffer memory so that the CPU can pass data to the GPU. In vCenter go to the VM, select Actions > Edit > VM Options > Advanced > Edit configuration.
Once you’re on the Configuration parameters screen, add two more parameters:
The 64bitMMIOSizeGB value is calculated by adding up the total GB of framebuffer memory on all GPUs attached to the VM. If the total GPU framebuffer memory falls on a power-of-2, setting pciPassthru.64bitMMIOSizeGB to the next power of 2 works.
If the total GPU framebuffer memory falls between two powers-of-2, round up to the next power of 2, then round up again, to get a working setting.
Powers of 2 are 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024 …
For example, two NVIDIA A100 cards with 40GB each = 80GB (in between 64GB and 128GB), so round up to the next power of 2 (128GB), then round up again to the next power of 2 after that (256GB) to get the correct setting. If you set it too low the VM won’t boot, but it won’t give you an error message telling you what the issue is either.
Here are some configurations that I’ve tested and verified:
2 x 16GB NVIDIA V100 = 32GB, 32 is a power of 2, so round up to the next power of 2 which is 64, set pciPassthru.64bitMMIOSizeGB = 64 to boot.
2 x 24GB NVIDIA P40 = 48GB, which is in-between 32 and 64, round up to 64 and again to 128, requires pciPassthru.64bitMMIOSizeGB = 128 to boot.
8 x 16GB NVIDIA V100 = 128GB, 128 is a power of 2, so round up to the next power of 2 which is 256, set pciPassthru.64bitMMIOSizeGB = 256 to boot.
10 x 16GB NVIDIA V100 = 160GB, which is in-between 128 and 256, round up to 256 and again to 512, set pciPassthru.64bitMMIOSizeGB = 512 to boot.
RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. It works by encapsulating an Infiniband (IB) transport packet and sending it over Ethernet. If you’re working with network applications that require high bandwidth and low latency, RDMA will give you lower latency, higher bandwidth, and a lower CPU load than an API such as Berkeley sockets.
Full disclosure: I used to work for a startup called Bitfusion, and that startup was bought by VMware, so I now work for VMware. At Bitfusion we developed a technology for accessing hardware accelerators, such as NVIDIA GPUs, remotely across networks using TCP/IP, Infiniband, and PVRDMA. I still work on the Bitfusion product at VMware, and spend a lot of my time getting AI and ML workloads to work across networks on virtualized GPUs.
In my lab I’m using Mellanox Connect/X5 and ConnectX/6 cards on hosts that are running ESXi 7.0.2 and vCenter 7.0.2. The cards are connected to a Mellanox Onyx MSN2700 100GbE switch.
Since I’m working with Ubuntu 18.04 and 20.04 virtual machines (VMs) in a vCenter environment, I have a couple of options for high-speed networking:
I can use PCI passthrough to pass the PCI network card directly through to the VM and use the network card’s native drivers on the VM to set up a networking stack. However this means that my network card is only available to a single VM on the host, and can’t be shared between VMs. It also breaks vMotion (the ability to live-migrate the VM to another host) since the VM is tied to a specific piece of hardware on a specific host. I’ve set this up in my lab but stopped doing this because of the lack of flexibility and because we couldn’t identify any performance difference compared to SR-IOV networking.
I can use SR-IOV and Network Virtual Functions (NVFs) to make the single card appear as if it’s multiple network cards with multiple PCI addresses, pass those through to the VM, and use the network card’s native drivers on the VM to set up a networking stack. I’ve set this up in my lab as well. I can share a single card between multiple VMs and the performance is similar to PCI passthough. The disadvantages are that setting up SR-IOV and configuring the NVFs is specific to a card’s model and manufacturer, so what works in my lab might not work in someone else’s environment.
I can set up PVRDMA networking and use the PVRDMA driver that comes with Ubuntu. This is what I’m going to show how to do in this article.
Set up your physical switch
First, make sure that your switch is set up correctly. On my Mellanox Onyx MSN2700 100GbE switch that means:
Enable the ports you’re connecting to.
Set the speed of each port to 100G.
Set auto-negotiation for each link.
Flowcontrol Mode: Global
LAG Mode: On
Set up your virtual switch
vCenter supports Paravirtual RDMA (PVRDMA) networking using Distributed Virtual Switches (DVS). This means you’re setting up a virtual switch in vCenter and you’ll connect your VMs to this virtual switch.
In vCenter navigate to Hosts and Clusters, then click the DataCenter icon (looks like a sphere or globe with a line under it). Find the cluster you want to add the virtual switch to, right click on the cluster and select Distributed Switch > New Distributed Switch.
Version: 7.0.2 – ESXi 7.0.2 and later
Number of uplinks: 4
Network I/O control: Disabled
Default port group: Create
Port Group Name: “VM 100GbE Network”
Figure out which NIC is the right NIC
Go to Hosts and Clusters
Select the host
Click the Configure tab, then Networking > Physical adapters
Note which NIC is the 100GbE NIC for each host
Add Hosts to the Distributed Virtual Switch
Go to Hosts and Clusters
Click the DataCenter icon
Select the Networks top tab and the Distributed Switches sub-tab
Right click “rdma-dvs”
Click “Add and Manage Hosts”
Select “Add Hosts”
Select the hosts. Use “auto” for uplinks.
Select the physical adapters based on the list you created in the previous step, or find the Mellanox card in the list and add it. If more than one is listed, look for the card that’s “connected”.
Manage VMkernel adapters (accept defaults)
Migrate virtual machine networking (none)
Tag a vmknic for PVRDMA
Select an ESXi host and go to the Configure tab
Go to System > Advanced System Settings
Filter on “PVRDMA”
Set Net.PVRDMAVmknic = "vmk0"
Repeat for each ESXi host.
Set up the firewall for PVRDMA
Select an ESXi host and go to the Configure tab
Go to System > Firewall
Scroll down to find pvrdma and check the box to allow PVRDMA traffic through the firewall.
Repeat for each ESXi host.
Set up Jumbo Frames for PVRDMA
To enable jumbo frames a vCenter cluster using virtual switches you have to set MTU 9000 on the Distributed Virtual Switch.
Click the Data Center icon.
Click the Distributed Virtual Switch that you want to set up, “rdma-dvs” in this example.
Go to the Configure tab.
Select Settings > Properties.
Look at Properties > Advanced > MTU. This should be set to 9000. If it’s not, click Edit.
In order for RDMA to work the vmw_pvrdma module has to be loaded after several other modules. Maybe someone else knows a better way to do this, but the method that I got to work was adding a script /usr/local/sbin/rdma-modules.sh to ensure that Infiniband modules are loaded on boot, then calling that from /etc/rc.local so it gets executed at boot time.
# modules that need to be loaded for PVRDMA to work
# Once those are loaded, reload the vmw_pvrdma module
/sbin/modprobe -r vmw_pvrdma
Once that’s done just set up the PVRDMA network interface the same as any other network interface.
Testing the network
To verify that I’m getting something close to 100Gbps on the network I use the perftest package.
To test bandwith I pick two VMs on different hosts. On one VM I run:
$ ib_send_bw --report_gbits
On the other VM I run the same command plus I add the IP address of the PVRDMA interface on the first machine:
$ ib_send_bw --report_gbits 192.168.128.39
That sends a bunch of data across the network and reports back:
So I’m getting an average of 96.31Gbps over the network connection.
I can also check the latency using the ib_send_lat:
Hope you find this useful.
Manage Cookie Consent
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.