GPU | Earl C. Ruby III

When adding a GPU to a vSphere VM using PCI passthrough there are a couple of additional settings that you need to make or your VM won’t boot.

When creating the VM you’ll need to set the Actions > Edit > VM Options > Boot Options > Firmware and select “EFI”. You need to do this before you install the operating system on the VM. If you don’t do this the GPUs won’t work and the VM won’t boot.

To add a GPU, in vCenter go to the VM, select Actions > Edit > Add New Device. Any GPUs set up as PCI passthrough devices should appear in a pick list. Add one or more GPUs to your VM.

Note that after adding one device, when you add additional GPUs the first GPU you selected still appears in the pick list. If you add the same GPU more than once your VM will not boot. If you add a GPU that’s being used by another running VM your VM will not boot. Pay attention to the PCI bus addresses displayed and make sure that the GPUs you pick are unique and not in use on another VM.

Finally you have to set up memory-mapped I/O (MMIO) to map system memory to the GPU’s framebuffer memory so that the CPU can pass data to the GPU. In vCenter go to the VM, select Actions > Edit > VM Options > Advanced > Edit configuration.

Once you’re on the Configuration parameters screen, add two more parameters:

pciPassthru.use64bitMMIO = TRUE
pciPassthru.64bitMMIOSizeGB = ????

Actions > Edit > VM Options > Advanced > Edit configuration

The 64bitMMIOSizeGB value is calculated by adding up the total GB of framebuffer memory on all GPUs attached to the VM. If the total GPU framebuffer memory falls on a power-of-2, setting pciPassthru.64bitMMIOSizeGB to the next power of 2 works.

If the total GPU framebuffer memory falls between two powers-of-2, round up to the next power of 2, then round up again, to get a working setting.

Powers of 2 are 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024 …

For example, two NVIDIA A100 cards with 40GB each = 80GB (in between 64GB and 128GB), so round up to the next power of 2 (128GB), then round up again to the next power of 2 after that (256GB) to get the correct setting. If you set it too low the VM won’t boot, but it won’t give you an error message telling you what the issue is either.

Here are some configurations that I’ve tested and verified:

2 x 16GB NVIDIA V100 = 32GB, 32 is a power of 2, so round up to the next power of 2 which is 64, set pciPassthru.64bitMMIOSizeGB = 64 to boot.
2 x 24GB NVIDIA P40 = 48GB, which is in-between 32 and 64, round up to 64 and again to 128, requires pciPassthru.64bitMMIOSizeGB = 128 to boot.
8 x 16GB NVIDIA V100 = 128GB, 128 is a power of 2, so round up to the next power of 2 which is 256, set pciPassthru.64bitMMIOSizeGB = 256 to boot.
10 x 16GB NVIDIA V100 = 160GB, which is in-between 128 and 256, round up to 256 and again to 512, set pciPassthru.64bitMMIOSizeGB = 512 to boot.

Hope you find this useful.

NVIDIA publishes a set of NVIDIA GPU-accelerated Containers (NGC) with applications and frameworks for machine learning, deep learning, and high-performance computing.

VMware developed a platform that allows people and companies to create their own private clouds. For customers with high-speed, low-latency networking requirements they offer a couple of different networking options, one of which is PVRDMA (ParaVirtualized Remote Direct Memory Access) networking.

Full disclosure: I used to work for a startup called Bitfusion, and that startup was bought by VMware, so I now work for VMware. At Bitfusion we developed a technology for accessing hardware accelerators, such as NVIDIA GPUs, remotely across networks using TCP/IP, Infiniband, and PVRDMA. I still work on the Bitfusion product at VMware, and spend a lot of my time getting AI and ML workloads to work across networks on virtualized GPUs.

OpenFabrics Enterprise Distribution (OFED) is open-source software for RDMA applications which includes a set of drivers for high-speed network cards to enable RDMA/Infiniband networking. Some NVIDIA NGC containers ship with Mellanox OFED (MOFED) installed. NVIDIA bought Mellanox in 2020, and MOFED is NVIDIA’s distribution of OFED with all of the non-Mellanox drivers removed. OFED includes support for PVRDMA, but MOFED does not.

NVIDIA containers are based on Ubuntu base images. Ubuntu ships its own RDMA drivers in a package called rdma-core. The Ubuntu rdma-core package contains the open source drivers and utilities needed to work with VMware PVRDMA networking.

The Ubuntu rdma-core package contains the open source drivers and utilities needed to work with VMware PVRDMA networking.

Ideally you should only install the RDMA network package that you need, either MOFED or OFED or rdma-core, but not more than one of them. In fact, if you try installing more than one you will have problems. Therefore, if you’re going to use NGC containers on a PVRDMA network you should first remove the MOFED packages and then add the rdma-core packages.

Luckily you can start an NGC container and see if MOFED is installed or not and see what version is installed. If I start the NGC container for Tensor RT:

docker run -it --rm -u root nvcr.io/nvidia/tensorrt:19.09-py3

I can see that it’s based on Ubuntu 18.04 “bionic”:

root@2e70d41e1187:/workspace# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

If I look inside /opt/mellanox/DEBS/ I can see if any MOFED .deb files are installed:

root@2e70d41e1187:/workspace# ls -al /opt/mellanox/DEBS/
total 64
drwxrwxr-x 15 root root 4096 Aug 27  2019 .
drwxr-xr-x  3 root root 4096 Sep 13  2019 ..
drwxrwxr-x  2 root root 4096 Aug 27  2019 3.4-1.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 3.4-2.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.0-1.0.1
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.0-2.0.0
lrwxrwxrwx  1 root root    9 Aug 27  2019 4.0-2.0.2 -> 4.0-2.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.1-1.0.2
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.2-1.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.2-1.2.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.3-1.0.1
lrwxrwxrwx  1 root root    9 Aug 27  2019 4.3-3.0.2 -> 4.3-1.0.1
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.4-1.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.4-2.0.7
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.5-1.0.1
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.6-1.0.1
lrwxrwxrwx  1 root root    9 Aug 27  2019 5.0-0 -> 5.0-1.1.8
drwxrwxr-x  2 root root 4096 Aug 27  2019 5.0-1.1.8
-rwxrwxr-x  1 root root  546 Aug 27  2019 add_mofed_version.sh

In this case there are Mellanox MOFED packages installed. If I look inside these directories (ls -1 /opt/mellanox/DEBS/*) I can see that the packages installed from MOFED are:

ibverbs-utils
libibverbs-dev
libibverbs1
libmlx5-1

These are MOFED versions of packages installed in this specific container. A different NGC container might contain these MOFED packages, or different MOFED packages, or no MOFED packages at all.

There are versions of these same packages in Ubuntu repos, and the Ubuntu versions conflict with the MOFED versions. To use the Ubuntu versions, first remove the MOFED packages:

root@2e70d41e1187:/workspace# apt-get purge -y ibverbs-utils libibverbs-dev libibverbs1 libmlx5-1
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be REMOVED:
  ibverbs-utils* libibverbs-dev* libibverbs1* libmlx5-1*
0 upgraded, 0 newly installed, 4 to remove and 23 not upgraded.
After this operation, 1523 kB disk space will be freed.
(Reading database ... 18622 files and directories currently installed.)
Removing ibverbs-utils (41mlnx1-OFED.4.4.1.0.0.44100) ...
Removing libibverbs-dev (41mlnx1-OFED.4.4.1.0.0.44100) ...
Removing libmlx5-1 (41mlnx1-OFED.4.4.0.1.7.44100) ...
Removing libibverbs1 (41mlnx1-OFED.4.4.1.0.0.44100) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
(Reading database ... 18449 files and directories currently installed.)
Purging configuration files for libmlx5-1 (41mlnx1-OFED.4.4.0.1.7.44100) ...

You can see in the output above that the packages that I removed have the name “OFED” in them, indicating that they came from MOFED/OFED, not Ubuntu. If I reinstall using rdma-core and the other packages I need:

apt-get update && apt-get install -y --reinstall \
    -t bionic rdma-core libibverbs1 ibverbs-providers \
    infiniband-diags ibverbs-utils libcapstone3

This installs everything from the Ubuntu repositories for the “bionic” version, which is the version of Ubuntu that this NGC container is based on. (Which we determined back in step 1.)

The -t flag is necessary because I’ve found that some NGC containers mix code from the repositories of different versions of Ubuntu, and we only want to install packages from the base Ubuntu version, which is “bionic” in this particular case.

At this point the container is ready to use PVRDMA connections.

However, I also want to connect to a remote Bitfusion server across a PVRDMA network and use a pool of GPUs for my TensorRT work, so I also install the Bitfusion client:

wget https://packages.vmware.com/bitfusion/ubuntu/18.04/bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

apt-get install -y ./bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

To create a new container with all of these changes I just have to whip up a small Dockerfile:

# Base this container on the NGC container you want to use
FROM nvcr.io/nvidia/tensorrt:19.09-py3

# Remove the MOFED packages that are installed,
# determined by running “ls -1 /opt/mellanox/DEBS/*”
RUN apt-get purge -y ibverbs-utils libibverbs-dev \
    libibverbs1 libmlx5-1

# Install the Ubuntu RDMA packages using the
# UBUNTU_CODENAME from /etc/os-release
# as the -t argument.
RUN apt-get update && apt-get install -y --reinstall \
    -t bionic \
    rdma-core libibverbs1 ibverbs-providers \
    infiniband-diags ibverbs-utils libcapstone3

# Install the Bitfusion 3.0.0 client software for Ubuntu 18.04
RUN wget https://packages.vmware.com/bitfusion/ubuntu/18.04/bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

RUN apt-get install -y ./bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

To build an image using this Dockerfile:

mkdir -p ~/build
docker build -t tensorrt:19.09-py3-pvrdma -f Dockerfile ~/build

Run this image:

docker run -it --rm -u root --network host \
    tensorrt:19.09-py3-pvrdma

In this instance I’m passing the host’s network through to the container. Assuming that the host already has PVRDMA networking set up correctly, I can use that PVRDMA network inside the NGC container. With the Bitfusion client in the container I can run TensorRT and access GPUs from a remote pool of GPUs across a PVRDMA network.

Hope you find this useful.

You may also be interested in my article Setting up a 100GbE PVRDMA Network on vCenter 7.

Earl C. Ruby III

Tag Archives: GPU

Calculating the value for 64bitMMIOSizeGB

Getting NVIDIA NGC containers to work with VMware PVRDMA networks