This is a talk Keith Bradley and I gave at VMware Explore 2024 Las Vegas called AI Without GPUs: Using Your Existing CPU Resources to Run AI Workloads. Keith Bradley is the Vice President of IT and Security at Nature Fresh Farms. Nature Fresh Farms uses AI to control every aspect of their agricultural operations and they’re using CPUs to process those AI workloads.
Graphics processing units (GPUs) are expensive, hard to acquire and extremely powerful, but there are many AI/ML applications that can run just fine without GPUs. This session covers how to use your existing central processing unit (CPU) resources to run AI workloads, what you can do, what you shouldn’t do and what types of problems you can solve without using any GPUs at all.
I was just interviewed by Frank Denneman for the Unexplored Territory podcast, where we talked about new computing hardware for running AI workloads on vSphere and VCF.
I was invited to AI Field Day 4 in Santa Clara last week to present a couple of talks on running AI workloads on Intel AMX CPUs. This is a recording of the talk I did on setting up Tanzu Kubernetes for running workloads that use Intel AMX CPUs.
I was invited to AI Field Day 4 in Santa Clara last week to present a couple of talks on running AI workloads on Intel AMX CPUs. This is a recording of the talk I did on running LLMs.
NVIDIA publishes a set of NVIDIA GPU-accelerated Containers (NGC) with applications and frameworks for machine learning, deep learning, and high-performance computing.
VMware developed a platform that allows people and companies to create their own private clouds. For customers with high-speed, low-latency networking requirements they offer a couple of different networking options, one of which is PVRDMA (ParaVirtualized Remote Direct Memory Access) networking.
Full disclosure: I used to work for a startup called Bitfusion, and that startup was bought by VMware, so I now work for VMware. At Bitfusion we developed a technology for accessing hardware accelerators, such as NVIDIA GPUs, remotely across networks using TCP/IP, Infiniband, and PVRDMA. I still work on the Bitfusion product at VMware, and spend a lot of my time getting AI and ML workloads to work across networks on virtualized GPUs.
OpenFabrics Enterprise Distribution (OFED) is open-source software for RDMA applications which includes a set of drivers for high-speed network cards to enable RDMA/Infiniband networking. Some NVIDIA NGC containers ship with Mellanox OFED (MOFED) installed. NVIDIA bought Mellanox in 2020, and MOFED is NVIDIA’s distribution of OFED with all of the non-Mellanox drivers removed. OFED includes support for PVRDMA, but MOFED does not.
NVIDIA containers are based on Ubuntu base images. Ubuntu ships its own RDMA drivers in a package called rdma-core. The Ubuntu rdma-core package contains the open source drivers and utilities needed to work with VMware PVRDMA networking.
Ideally you should only install the RDMA network package that you need, either MOFED or OFED or rdma-core, but not more than one of them. In fact, if you try installing more than one you will have problems. Therefore, if you’re going to use NGC containers on a PVRDMA network you should first remove the MOFED packages and then add the rdma-core packages.
Luckily you can start an NGC container and see if MOFED is installed or not and see what version is installed. If I start the NGC container for Tensor RT:
docker run -it --rm -u root nvcr.io/nvidia/tensorrt:19.09-py3
I can see that it’s based on Ubuntu 18.04 “bionic”:
If I look inside /opt/mellanox/DEBS/ I can see if any MOFED .deb files are installed:
root@2e70d41e1187:/workspace# ls -al /opt/mellanox/DEBS/
total 64
drwxrwxr-x 15 root root 4096 Aug 27 2019 .
drwxr-xr-x 3 root root 4096 Sep 13 2019 ..
drwxrwxr-x 2 root root 4096 Aug 27 2019 3.4-1.0.0
drwxrwxr-x 2 root root 4096 Aug 27 2019 3.4-2.0.0
drwxrwxr-x 2 root root 4096 Aug 27 2019 4.0-1.0.1
drwxrwxr-x 2 root root 4096 Aug 27 2019 4.0-2.0.0
lrwxrwxrwx 1 root root 9 Aug 27 2019 4.0-2.0.2 -> 4.0-2.0.0
drwxrwxr-x 2 root root 4096 Aug 27 2019 4.1-1.0.2
drwxrwxr-x 2 root root 4096 Aug 27 2019 4.2-1.0.0
drwxrwxr-x 2 root root 4096 Aug 27 2019 4.2-1.2.0
drwxrwxr-x 2 root root 4096 Aug 27 2019 4.3-1.0.1
lrwxrwxrwx 1 root root 9 Aug 27 2019 4.3-3.0.2 -> 4.3-1.0.1
drwxrwxr-x 2 root root 4096 Aug 27 2019 4.4-1.0.0
drwxrwxr-x 2 root root 4096 Aug 27 2019 4.4-2.0.7
drwxrwxr-x 2 root root 4096 Aug 27 2019 4.5-1.0.1
drwxrwxr-x 2 root root 4096 Aug 27 2019 4.6-1.0.1
lrwxrwxrwx 1 root root 9 Aug 27 2019 5.0-0 -> 5.0-1.1.8
drwxrwxr-x 2 root root 4096 Aug 27 2019 5.0-1.1.8
-rwxrwxr-x 1 root root 546 Aug 27 2019 add_mofed_version.sh
In this case there are Mellanox MOFED packages installed. If I look inside these directories (ls -1 /opt/mellanox/DEBS/*) I can see that the packages installed from MOFED are:
ibverbs-utils
libibverbs-dev
libibverbs1
libmlx5-1
These are MOFED versions of packages installed in this specific container. A different NGC container might contain these MOFED packages, or different MOFED packages, or no MOFED packages at all.
There are versions of these same packages in Ubuntu repos, and the Ubuntu versions conflict with the MOFED versions. To use the Ubuntu versions, first remove the MOFED packages:
root@2e70d41e1187:/workspace# apt-get purge -y ibverbs-utils libibverbs-dev libibverbs1 libmlx5-1
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be REMOVED:
ibverbs-utils* libibverbs-dev* libibverbs1* libmlx5-1*
0 upgraded, 0 newly installed, 4 to remove and 23 not upgraded.
After this operation, 1523 kB disk space will be freed.
(Reading database ... 18622 files and directories currently installed.)
Removing ibverbs-utils (41mlnx1-OFED.4.4.1.0.0.44100) ...
Removing libibverbs-dev (41mlnx1-OFED.4.4.1.0.0.44100) ...
Removing libmlx5-1 (41mlnx1-OFED.4.4.0.1.7.44100) ...
Removing libibverbs1 (41mlnx1-OFED.4.4.1.0.0.44100) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
(Reading database ... 18449 files and directories currently installed.)
Purging configuration files for libmlx5-1 (41mlnx1-OFED.4.4.0.1.7.44100) ...
You can see in the output above that the packages that I removed have the name “OFED” in them, indicating that they came from MOFED/OFED, not Ubuntu. If I reinstall using rdma-core and the other packages I need:
This installs everything from the Ubuntu repositories for the “bionic” version, which is the version of Ubuntu that this NGC container is based on. (Which we determined back in step 1.)
The -t flag is necessary because I’ve found that some NGC containers mix code from the repositories of different versions of Ubuntu, and we only want to install packages from the base Ubuntu version, which is “bionic” in this particular case.
At this point the container is ready to use PVRDMA connections.
However, I also want to connect to a remote Bitfusion server across a PVRDMA network and use a pool of GPUs for my TensorRT work, so I also install the Bitfusion client:
To create a new container with all of these changes I just have to whip up a small Dockerfile:
# Base this container on the NGC container you want to use
FROM nvcr.io/nvidia/tensorrt:19.09-py3
# Remove the MOFED packages that are installed,
# determined by running “ls -1 /opt/mellanox/DEBS/*”
RUN apt-get purge -y ibverbs-utils libibverbs-dev \
libibverbs1 libmlx5-1
# Install the Ubuntu RDMA packages using the
# UBUNTU_CODENAME from /etc/os-release
# as the -t argument.
RUN apt-get update && apt-get install -y --reinstall \
-t bionic \
rdma-core libibverbs1 ibverbs-providers \
infiniband-diags ibverbs-utils libcapstone3
# Install the Bitfusion 3.0.0 client software for Ubuntu 18.04
RUN wget https://packages.vmware.com/bitfusion/ubuntu/18.04/bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb
RUN apt-get install -y ./bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb
docker run -it --rm -u root --network host \
tensorrt:19.09-py3-pvrdma
In this instance I’m passing the host’s network through to the container. Assuming that the host already has PVRDMA networking set up correctly, I can use that PVRDMA network inside the NGC container. With the Bitfusion client in the container I can run TensorRT and access GPUs from a remote pool of GPUs across a PVRDMA network.
We use technologies like cookies to store and/or access device information. We do this to improve browsing experience and to show (non-) personalized ads. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional
Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.