Getting NVIDIA NGC containers to work with VMware PVRDMA networks

Posted on 2021-04-01 by Earl C. Ruby III

NVIDIA publishes a set of NVIDIA GPU-accelerated Containers (NGC) with applications and frameworks for machine learning, deep learning, and high-performance computing.

VMware developed a platform that allows people and companies to create their own private clouds. For customers with high-speed, low-latency networking requirements they offer a couple of different networking options, one of which is PVRDMA (ParaVirtualized Remote Direct Memory Access) networking.

Full disclosure: I used to work for a startup called Bitfusion, and that startup was bought by VMware, so I now work for VMware. At Bitfusion we developed a technology for accessing hardware accelerators, such as NVIDIA GPUs, remotely across networks using TCP/IP, Infiniband, and PVRDMA. I still work on the Bitfusion product at VMware, and spend a lot of my time getting AI and ML workloads to work across networks on virtualized GPUs.

OpenFabrics Enterprise Distribution (OFED) is open-source software for RDMA applications which includes a set of drivers for high-speed network cards to enable RDMA/Infiniband networking. Some NVIDIA NGC containers ship with Mellanox OFED (MOFED) installed. NVIDIA bought Mellanox in 2020, and MOFED is NVIDIA’s distribution of OFED with all of the non-Mellanox drivers removed. OFED includes support for PVRDMA, but MOFED does not.

NVIDIA containers are based on Ubuntu base images. Ubuntu ships its own RDMA drivers in a package called rdma-core. The Ubuntu rdma-core package contains the open source drivers and utilities needed to work with VMware PVRDMA networking.

The Ubuntu rdma-core package contains the open source drivers and utilities needed to work with VMware PVRDMA networking.

Ideally you should only install the RDMA network package that you need, either MOFED or OFED or rdma-core, but not more than one of them. In fact, if you try installing more than one you will have problems. Therefore, if you’re going to use NGC containers on a PVRDMA network you should first remove the MOFED packages and then add the rdma-core packages.

Luckily you can start an NGC container and see if MOFED is installed or not and see what version is installed. If I start the NGC container for Tensor RT:

docker run -it --rm -u root nvcr.io/nvidia/tensorrt:19.09-py3

I can see that it’s based on Ubuntu 18.04 “bionic”:

root@2e70d41e1187:/workspace# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

If I look inside /opt/mellanox/DEBS/ I can see if any MOFED .deb files are installed:

root@2e70d41e1187:/workspace# ls -al /opt/mellanox/DEBS/
total 64
drwxrwxr-x 15 root root 4096 Aug 27  2019 .
drwxr-xr-x  3 root root 4096 Sep 13  2019 ..
drwxrwxr-x  2 root root 4096 Aug 27  2019 3.4-1.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 3.4-2.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.0-1.0.1
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.0-2.0.0
lrwxrwxrwx  1 root root    9 Aug 27  2019 4.0-2.0.2 -> 4.0-2.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.1-1.0.2
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.2-1.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.2-1.2.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.3-1.0.1
lrwxrwxrwx  1 root root    9 Aug 27  2019 4.3-3.0.2 -> 4.3-1.0.1
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.4-1.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.4-2.0.7
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.5-1.0.1
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.6-1.0.1
lrwxrwxrwx  1 root root    9 Aug 27  2019 5.0-0 -> 5.0-1.1.8
drwxrwxr-x  2 root root 4096 Aug 27  2019 5.0-1.1.8
-rwxrwxr-x  1 root root  546 Aug 27  2019 add_mofed_version.sh

In this case there are Mellanox MOFED packages installed. If I look inside these directories (ls -1 /opt/mellanox/DEBS/*) I can see that the packages installed from MOFED are:

ibverbs-utils
libibverbs-dev
libibverbs1
libmlx5-1

These are MOFED versions of packages installed in this specific container. A different NGC container might contain these MOFED packages, or different MOFED packages, or no MOFED packages at all.

There are versions of these same packages in Ubuntu repos, and the Ubuntu versions conflict with the MOFED versions. To use the Ubuntu versions, first remove the MOFED packages:

root@2e70d41e1187:/workspace# apt-get purge -y ibverbs-utils libibverbs-dev libibverbs1 libmlx5-1
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be REMOVED:
  ibverbs-utils* libibverbs-dev* libibverbs1* libmlx5-1*
0 upgraded, 0 newly installed, 4 to remove and 23 not upgraded.
After this operation, 1523 kB disk space will be freed.
(Reading database ... 18622 files and directories currently installed.)
Removing ibverbs-utils (41mlnx1-OFED.4.4.1.0.0.44100) ...
Removing libibverbs-dev (41mlnx1-OFED.4.4.1.0.0.44100) ...
Removing libmlx5-1 (41mlnx1-OFED.4.4.0.1.7.44100) ...
Removing libibverbs1 (41mlnx1-OFED.4.4.1.0.0.44100) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
(Reading database ... 18449 files and directories currently installed.)
Purging configuration files for libmlx5-1 (41mlnx1-OFED.4.4.0.1.7.44100) ...

You can see in the output above that the packages that I removed have the name “OFED” in them, indicating that they came from MOFED/OFED, not Ubuntu. If I reinstall using rdma-core and the other packages I need:

apt-get update && apt-get install -y --reinstall \
    -t bionic rdma-core libibverbs1 ibverbs-providers \
    infiniband-diags ibverbs-utils libcapstone3

This installs everything from the Ubuntu repositories for the “bionic” version, which is the version of Ubuntu that this NGC container is based on. (Which we determined back in step 1.)

The -t flag is necessary because I’ve found that some NGC containers mix code from the repositories of different versions of Ubuntu, and we only want to install packages from the base Ubuntu version, which is “bionic” in this particular case.

At this point the container is ready to use PVRDMA connections.

However, I also want to connect to a remote Bitfusion server across a PVRDMA network and use a pool of GPUs for my TensorRT work, so I also install the Bitfusion client:

wget https://packages.vmware.com/bitfusion/ubuntu/18.04/bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

apt-get install -y ./bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

To create a new container with all of these changes I just have to whip up a small Dockerfile:

# Base this container on the NGC container you want to use
FROM nvcr.io/nvidia/tensorrt:19.09-py3

# Remove the MOFED packages that are installed,
# determined by running “ls -1 /opt/mellanox/DEBS/*”
RUN apt-get purge -y ibverbs-utils libibverbs-dev \
    libibverbs1 libmlx5-1

# Install the Ubuntu RDMA packages using the
# UBUNTU_CODENAME from /etc/os-release
# as the -t argument.
RUN apt-get update && apt-get install -y --reinstall \
    -t bionic \
    rdma-core libibverbs1 ibverbs-providers \
    infiniband-diags ibverbs-utils libcapstone3

# Install the Bitfusion 3.0.0 client software for Ubuntu 18.04
RUN wget https://packages.vmware.com/bitfusion/ubuntu/18.04/bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

RUN apt-get install -y ./bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

To build an image using this Dockerfile:

mkdir -p ~/build
docker build -t tensorrt:19.09-py3-pvrdma -f Dockerfile ~/build

Run this image:

docker run -it --rm -u root --network host \
    tensorrt:19.09-py3-pvrdma

In this instance I’m passing the host’s network through to the container. Assuming that the host already has PVRDMA networking set up correctly, I can use that PVRDMA network inside the NGC container. With the Bitfusion client in the container I can run TensorRT and access GPUs from a remote pool of GPUs across a PVRDMA network.

Hope you find this useful.

Fixing Docker pull timeout errors in Jenkins

Posted on 2020-10-28 by Earl C. Ruby III

Jenkins has a Docker plugin that allows you to authenticate with a docker image registry, pull the container image that you want, and then run tests inside the container. The plugin works well, and you’re guaranteed a “known starting state” for your test since you’re running in a freshly-instantiated container.

The issue that I ran into is that my docker registry is in a different data center than my test environment, there are tests running 24×7, and occasionally there are temporary outages on the Internet in-between the two data centers. When that happens the plugin’s docker.image().pull() command will fail. Even if the image is stored locally it will fail because it attempts to verify the locally-stored image’s signature against the registry image’s signature.

Since the Jenkinsfile is written in Groovy my solution was to add a Groovy function to the top of the Jenkinsfile that wrapped the docker.image().pull() command inside of a retry loop:

// Since docker.image().pull() will fail during
// intermittent or temporary network outages wrap
// it in a retry loop.
def public static docker_image_pull(Object docker, String imageName) {
    def int attempt = 0
    def int max_attempts = 10
    def boolean success = false
    while ((! success) && (attempt < 10)) {
        try {
            docker.image(imageName).pull()
            success = true
        } catch (Exception err) {
            attempt += 1
            def sleep_sec = attempt * 30
            println("Attempt #${attempt}: Failed to pull ${imageName}. Sleeping ${sleep_sec}s")
            sleep(sleep_sec)
        }
    }
    if (! success) {
        throw new Exception("Failed to pull ${imageName}")
    }
}

The function will attempt to pull the image up to 10 times. It has a back-off delay so it first sleeps 30s and retries, then 60s, then 90s, etc. For temporary / intermittent network outages this eventually succeeds. For longer network outages, or of the problem is something else such as “image doesn’t exist” the function will eventually time out and exit with an error.

After that I just had to replace every instance of docker.image('myImageName:latest').pull() inside the Jenkinsfile with:

docker_image_pull(docker, 'myImageName:latest')

Hope you find this useful.

Setting up a personal, production-quality Kubernetes cluster with Kubespray

Posted on 2018-12-18 by Earl C. Ruby III

I’ve been setting up and tearing down Kubernetes clusters for testing various things for the past year, mostly using Vagrant/Virtualbox but also some VMware vSphere and OpenStack deployments.

I wanted to set something a little more permanent up at my home lab — a cluster where I could add and remove nodes, run nodes on multiple physical machines, and use different types of compute hardware.

Set up the virtual machines

To get started I used a desktop System76 Wild Dog Pro Linux box (4.5 GHz i7-7700K, 64GB DDR4) and my create-vm script to create six Ubuntu 18.04 “Bionic Beaver” VMs for the cluster:

for n in $(seq 1 6); do
    create-vm -n node$n \
      -i ./ubuntu-18.04-server-amd64.iso \
      -k ./ubuntu.ks \
      -r 4096 \
      -c 2 \
      -s 40
done

With these parameters each VM will have 4GB RAM, 2 VCPUs, and a 40GB hard drive.

Install and configure Kubespray

I cloned Kubespray into a directory and created an Ansible inventory file following the instructions from the README.

git clone git@github.com:kubernetes-sigs/kubespray.git
cd kubespray
pip install -r requirements.txt
rm -Rf inventory/mycluster/
cp -rfp inventory/sample inventory/mycluster
declare -a IPS=($(for n in $(seq 1 6); do get-vm-ip node$n; done))
CONFIG_FILE=inventory/mycluster/hosts.ini \
  python3 contrib/inventory_builder/inventory.py ${IPS[@]}

The get-vm-ip script is in the same repo as the create-vm script, and both are described in my Use .iso and Kickstart files to automatically create Ubuntu VMs article.

The inventory.py script generates an Ansible hosts inventory file in inventory/mycluster/hosts.ini with all of your VM IP addresses.

I like to add one variable override to the bottom of hosts.ini which copies the kubectl credentials over to my host machine. That way I can run kubectl commands directly from my desktop. The extra lines to add to the bottom of hosts.ini are:

[all:vars]
kubectl_localhost=true

Install Kubernetes

To install Kubernetes on the VMs I run the Kubespray cluster.yaml playbook:

export ANSIBLE_REMOTE_USER=ansible
ansible-playbook -i inventory/mycluster/hosts.ini \
    --become --become-user=root cluster.yml

Once the playbooks have finished, you should have a fully-operational Kubernetes cluster running on your desktop.

At this point you should be able to query the cluster from your desktop using kubectl. For example:

$ kubectl cluster-info
Kubernetes master is running at https://192.168.122.251:6443
coredns is running at https://192.168.122.251:6443/api/v1/namespaces/kube-system/services/coredns:dns/proxy
kubernetes-dashboard is running at https://192.168.122.251:6443/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

$ kubectl get nodes
NAME    STATUS   ROLES         AGE    VERSION
node1   Ready    master,node   3d6h   v1.13.0
node2   Ready    master,node   3d6h   v1.13.0
node3   Ready    node          3d6h   v1.13.0
node4   Ready    node          3d6h   v1.13.0
node5   Ready    node          3d6h   v1.13.0
node6   Ready    node          3d6h   v1.13.0

$ kubectl get pods --all-namespaces
NAMESPACE          NAME                                           READY   STATUS      RESTARTS   AGE
kube-system        calico-kube-controllers-67f89845f-6zbvx        1/1     Running     1          3d6h
kube-system        calico-node-jh7ng                              1/1     Running     2          3d6h
kube-system        calico-node-l9vfb                              1/1     Running     2          3d6h
kube-system        calico-node-mqxjx                              1/1     Running     2          3d6h
...

Set up the Kubernetes Dashboard

One of the first things I like to do is set up access to the Kubernetes dashboard. First I set up a service account for the admin user:

$ cat ~/Projects/k8s-cluster/dashboard-adminuser.yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: admin-user
  namespace: kube-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: admin-user
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: admin-user
  namespace: kube-system

$ kubectl apply -f ~/Projects/k8s-cluster/dashboard-adminuser.yaml

Next I get the bearer token for the user account:

$ kubectl -n kube-system describe secret $(kubectl -n kube-system get secret | grep admin-user | awk '{print $1}')

Finally I plug the dashboard URL that I got from kubectl cluster-info into my browser, select “Token” authentication, and cut and paste in the bearer token to log into the system.

Once logged in, an overview of my cluster pops up:

With a minimal amount of working compute infrastructure, it’s easy to set up your own production-quality Kubernetes cluster using Kubespray.

Hope you find this useful.

Mounting AWS EFS volumes inside Docker Containers

Posted on 2016-07-21 by Earl C. Ruby III

Amazon announced the development of the Amazon Elastic File System (AWS EFS) in 2015. EFS was designed to provide multiple EC2 instances with shared, low-latency access to a fully-managed file system. On June 28, 2016 Amazon announced that EFS is now available for production use in the US East (Northern Virginia), US West (Oregon), and Europe (Ireland) Regions.

Apcera‘s NFS Service Gateway can be used to access AWS EFS storage volumes within containers. You can use EFS to provide persistent storage to your containers running on AWS-hosted clouds in regions where EFS is available.

Gathering information

Before you begin you will need to know:

The name of the AWS Region where your Apcera Platform is running
The name/ID of the AWS VPC where your Apcera Platform is running
The name/ID of the AWS security group for your Apcera Platform

Setting up an EFS volume

Log into your AWS console.
Select the name of the AWS Region where your Apcera Platform is running on the upper right side of the screen.
Select Elastic File System.
Click Create File System.
Configure the file system access:
- Select the name of the VPC.
- The availability zone and subnet should be selected for you automatically.
- If your VPC has more than one subnet (unusual) then select the subnet containing the Instance Managers that will be connecting to the EFS volume.
- Leave IP address set to Automatic.
- The first EFS volume you create will create a new security group. Use that security group for this and all future EFS volumes. Write down the name of the new EFS security group – we’ll configure it in the next few steps.
- Click Next Step
Configure optional settings:
- Set the name of the EFS volume.
- Choose the performance mode.
- Click Next Step
  mounting-aws-with-esf-inside-docker
Review and create:
If everything looks OK, click Create File System.
mounting-aws-with-esf-inside-docker
You should see a “Success!” message and a new EFS volume with “Life Cycle State” = “Creating”.
Write down the IP address of the EFS volume.

Update the EFS security group

Go back to the main console menu and select EC2.
Click Security Groups in the left hand nav menu.
Type the name of the new EFS security group into the search filter list.
On the bottom half of the screen delete the default inbound and outbound rules.
Add one inbound rule to allow all TCP traffic on port 2049 from the source “name/ID of the AWS security group for your Apcera Platform”
Add one outbound rule to allow all TCP traffic on port 2049 to the destination “name/ID of the AWS security group for your Apcera Platform”
This allows all VMs within your Apcera Platform security group to connect to your EFS volume on port 2049 (NFS).
No other traffic from any other source or to any other destination is allowed.|

Create an NFS Provider for the EFS volume

We’re going to create a single provider for the EFS volume. Each time you have a container or set of containers that need a persistent file system, just create a new service from the same provider. Each new service will carve out a new namespace on the EFS volume, keeping the files associated with that service separate from the files in all other services that use the same provider.

According to the EFS FAQ, When you create a file system, you create endpoints in your VPC called “mount targets.” Each mount target provides an IP address and a DNS name, and you use this IP address or DNS name in your mount command. Only resources that can access a mount target can access your file system. Since the Apcera Platform isn’t using Amazon DNS services internally, we’ll use the IP address to connect to the EFS volume.

To create the provider, you need to construct a URL describing the volume. In this case, we’ll use the internal IP address of the EFS volume as the hostname and / as the exported volume name. All EFS volumes use the NFS v4.1 protocol. If the IP address of the EFS volume is 10.0.0.112 we’d construct a provider using:

apc provider register awsefs --type nfs \
    --url "nfs://10.0.0.112/" \
    --description 'Amazon EFS' \
    --batch \
    -- --version 4.1

Create a service from the provider:

apc service create efs-service-1 \
    --provider awsefs \
    --description 'Amazon EFS Service' \
    --batch

Create a capsule, bind the service to the capsule, and connect to the capsule:

apc capsule create efs-capsule1 --image linux -ae --batch
apc service bind efs-service-1 --job efs-capsule1 \
    --batch -- --mountpath /an/unlimited/supply
apc capsule connect efs-capsule1

Once connected, type df -k to see the mounted file system.

You can bind this service to any container that needs a shared, persistent file system. Each time you need a new shared, persistent file system for a container or group of containers just create a new service using the same provider and bind the service to your job or jobs.

Persistence for Docker

Now that we have a provider that can carve out EFS storage for containers, let’s try spinning up some Docker images.

On the Apcera Platform, if the specification for a Docker image (Dockerfile) specifies that the app requires persisted volumes, you must do one of the following when creating the job:

Include the –provider flag when you create or run the Docker job. You must include this flag if you include the –volume flag when creating or running the Docker job.
Include the –ignore-volumes flag when you create or run the Docker job.

Here is an example of running NGINX inside a Docker container on the Apcera platform, where the content for the site is stored on an EFS volume:

I’m using the Apcera “apc” command-line tool to build the container, pulling the nginx image directly off hub.docker.com, telling it to use the awsefs EFS volume provider I created earlier for persistence, and to mount the EFS volume at the mount point “/usr/share/nginx/html”.

Now connect to the container:

/proc/mounts contains a list of all of the container’s mount points. I can verify that the container does indeed have an EFS volume by grepping /proc/mounts for the mount point:

Grepping for “/usr/share/nginx/html” shows the IP address 10.0.0.112, which is the IP of the EFS volume, the log directory name after is the unique namespace for the service, the mountpoint is “/usr/share/nginx/html”, and the mount type “nfs4”.

There is no content in the directory, so I add some by echoing some HTML code to an index.html file. My container will proclaim to the world “NGINX in a Docker container on Apcera with content stored on EFS” in an H3 typeface!

Now that I have some content I need to add a route to the content. Right now the NGINX container is running, and listening on ports 80 and 443, but it’s completely isolated from the outside world — no one can connect to those ports unless there’s a route (a URL) set up.

My cluster is running on the domain earlruby.apcera-platform.io, so I add a route like so:

I have successfully added the http route http://nginx.earlruby.apcera-platform.io/ to my NGINX container. This is a real public DNS entry. To verify that it works I point my browser at the route I just added:

Success!

Such an amazing app is bound to go viral, and a single NGINX container may not be able to keep up with the load. I want to ensure that my app can keep up and remain highly-available, and that it keeps running even if one or more VMs in my cluster get killed off, so I add more NGINX containers:

Now I’ve got 20 containers running my NGINX app, all serving up the same content, running on multiple VMs across my cluster, all load-balanced under the single URL http://nginx.earlruby.apcera-platform.io/. If any container gets killed off, the Apcera platform will spin up a new one. If any VM in the cluster dies, any containers running on it will automatically be migrated to new hosts. If I want to scale up the app to 100 or 1000 containers, or back down to 1, it’s a one-line command to make the change.

In terms of resources, I’m using slightly less than 45 MiB to run those 20 containers. That’s not a typo — 45MiB! Containers are much more efficient users of RAM than VMs.

I hope you find this useful.

This article originally appeared as an Apcera blog post on July 21, 2016.

Policy-based Cloud Storage

Posted on 2015-09-29 by Earl C. Ruby III

This is a talk I gave last week at the SF Microservices Meetup titled Policy-based Cloud Storage, Persisting Data in a Multi-Site, Multi-Cloud World. In it I cover Apcera‘s approach to storage for containers and how to use policy to manage very large scale application deployments.