post

Getting NVIDIA NGC containers to work with VMware PVRDMA networks

NVIDIA publishes a set of NVIDIA GPU-accelerated Containers (NGC) with applications and frameworks for machine learning, deep learning, and high-performance computing.

VMware developed a platform that allows people and companies to create their own private clouds. For customers with high-speed, low-latency networking requirements they offer a couple of different networking options, one of which is PVRDMA (ParaVirtualized Remote Direct Memory Access) networking.

Full disclosure: I used to work for a startup called Bitfusion, and that startup was bought by VMware, so I now work for VMware. At Bitfusion we developed a technology for accessing hardware accelerators, such as NVIDIA GPUs, remotely across networks using TCP/IP, Infiniband, and PVRDMA. I still work on the Bitfusion product at VMware, and spend a lot of my time getting AI and ML workloads to work across networks on virtualized GPUs.

Some NVIDIA NGC containers ship with Mellanox OFED installed. OFED is a set of drivers for high speed network cards to enable RDMA/Infiniband networking. The drivers installed in these containers do not include support for PVRDMA.

NVIDIA containers are based on Ubuntu base images. Ubuntu ships its own RDMA drivers in a package called rdma-core. The Ubuntu rdma-core package contains the drivers needed to work with VMware PVRDMA networking.

The Ubuntu rdma-core package contains the drivers needed to work with VMware PVRDMA networking.

Ideally you should only install the RDMA network package that you need, either OFED or rdma-core, not both. In fact, if you try running both you will have problems. Therefore, if you’re going to use NGC containers on a PVRDMA network you should first remove the OFED packages and then add the rdma-core packages.

Luckily you can start an NGC container and see if OFED is installed or not and see what version is installed. If I start the NGC container for Tensor RT:

docker run -it --rm -u root nvcr.io/nvidia/tensorrt:19.09-py3

I can see that it’s based on Ubuntu 18.04 “bionic”:

root@2e70d41e1187:/workspace# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

If I look inside /opt/mellanox/DEBS/ I can see if any OFED .deb files are installed:

root@2e70d41e1187:/workspace# ls -al /opt/mellanox/DEBS/
total 64
drwxrwxr-x 15 root root 4096 Aug 27  2019 .
drwxr-xr-x  3 root root 4096 Sep 13  2019 ..
drwxrwxr-x  2 root root 4096 Aug 27  2019 3.4-1.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 3.4-2.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.0-1.0.1
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.0-2.0.0
lrwxrwxrwx  1 root root    9 Aug 27  2019 4.0-2.0.2 -> 4.0-2.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.1-1.0.2
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.2-1.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.2-1.2.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.3-1.0.1
lrwxrwxrwx  1 root root    9 Aug 27  2019 4.3-3.0.2 -> 4.3-1.0.1
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.4-1.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.4-2.0.7
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.5-1.0.1
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.6-1.0.1
lrwxrwxrwx  1 root root    9 Aug 27  2019 5.0-0 -> 5.0-1.1.8
drwxrwxr-x  2 root root 4096 Aug 27  2019 5.0-1.1.8
-rwxrwxr-x  1 root root  546 Aug 27  2019 add_mofed_version.sh

In this case there are Mellanox OFED packages installed. If I look inside these directories (ls -1 /opt/mellanox/DEBS/*) I can see that the packages installed from OFED are:

  • ibverbs-utils
  • libibverbs-dev
  • libibverbs1
  • libmlx5-1

These are OFED versions of packages installed in this specific container. A different NGC container might contain these OFED packages, or different OFED packages, or no OFED packages at all.

There are versions of these same packages in Ubuntu repos, and the Ubuntu versions conflict with the OFED versions. To use the Ubuntu versions, first remove the OFED packages:

root@2e70d41e1187:/workspace# apt-get purge -y ibverbs-utils libibverbs-dev libibverbs1 libmlx5-1
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be REMOVED:
  ibverbs-utils* libibverbs-dev* libibverbs1* libmlx5-1*
0 upgraded, 0 newly installed, 4 to remove and 23 not upgraded.
After this operation, 1523 kB disk space will be freed.
(Reading database ... 18622 files and directories currently installed.)
Removing ibverbs-utils (41mlnx1-OFED.4.4.1.0.0.44100) ...
Removing libibverbs-dev (41mlnx1-OFED.4.4.1.0.0.44100) ...
Removing libmlx5-1 (41mlnx1-OFED.4.4.0.1.7.44100) ...
Removing libibverbs1 (41mlnx1-OFED.4.4.1.0.0.44100) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
(Reading database ... 18449 files and directories currently installed.)
Purging configuration files for libmlx5-1 (41mlnx1-OFED.4.4.0.1.7.44100) ...

You can see in the output above that the packages that I removed have the name “OFED” in them, indicating that they came from OFED, not Ubuntu. If I reinstall using rdma-core and the other packages I need:

apt-get update && apt-get install -y --reinstall \
    -t bionic rdma-core libibverbs1 ibverbs-providers \
    infiniband-diags ibverbs-utils libcapstone3

This installs everything from the Ubuntu repositories for the “bionic” version, which is the version of Ubuntu that this NGC container is based on. (Which we determined back in step 1.)

The -t flag is necessary because I’ve found that some NGC containers mix code from the repositories of different versions of Ubuntu, and we only want to install packages from the base Ubuntu version, which is “bionic” in this particular case.

At this point the container is ready to use PVRDMA connections.

However, I also want to connect to a remote Bitfusion server across a PVRDMA network and use a pool of GPUs for my TensorRT work, so I also install the Bitfusion client:

wget https://packages.vmware.com/bitfusion/ubuntu/18.04/bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

apt-get install -y ./bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

To create a new container with all of these changes I just have to whip up a small Dockerfile:

# Base this container on the NGC container you want to use
FROM nvcr.io/nvidia/tensorrt:19.09-py3

# Remove the OFED packages that are installed,
# determined by running “ls -1 /opt/mellanox/DEBS/*”
RUN apt-get purge -y ibverbs-utils libibverbs-dev \
    libibverbs1 libmlx5-1

# Install the Ubuntu RDMA packages using the
# UBUNTU_CODENAME from /etc/os-release
# as the -t argument.
RUN apt-get update && apt-get install -y --reinstall \
    -t bionic \
    rdma-core libibverbs1 ibverbs-providers \
    infiniband-diags ibverbs-utils libcapstone3

# Install the Bitfusion 3.0.0 client software for Ubuntu 18.04
RUN wget https://packages.vmware.com/bitfusion/ubuntu/18.04/bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

RUN apt-get install -y ./bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

To build an image using this Dockerfile:

mkdir -p ~/build
docker build -t tensorrt:19.09-py3-pvrdma -f Dockerfile ~/build

Run this image:

docker run -it --rm -u root --network host \
    tensorrt:19.09-py3-pvrdma

In this instance I’m passing the host’s network through to the container. Assuming that the host already has PVRDMA networking set up correctly, I can use that PVRDMA network inside the NGC container. With the Bitfusion client in the container I can run TensorRT and access GPUs from a remote pool of GPUs across a PVRDMA network.

Hope you find this useful.

Share Button
post

Upgrading vCenter 7 via the command line

I have vCenter 7.0.0.10700 installed and I want to update to 7.0.1.00200. When I run Update Planner > Interoperability it reports that all of my ESXi hosts are running ESXi 7.0.1. If I run the pre-update checks I get “No issues found”. When I go to the appliance to do the upgrade, both “Stage Only” and “Stage and Install” are greyed-out and unselectable.

vCenter 7 Appliance Available Updates screen

I tried a dozen different tricks, including ssh-ing into the appliance as root and editing the /etc/applmgmt/appliance/software_update_state.conf file, but nothing could enable the “Stage Only” and “Stage and Install” buttons.

Use the command line

I finally decided to try upgrading via the command line. I have backups going back 30 days. I even double-checked and yes, my NFS server has files in the backup directory for each of the past 30 days and they have data in them. There’s probably even a way to restore one of those backups if something goes horribly wrong. Onwards!

I was already logged into the vCenter appliance shell as root. The next thing I needed to do was to figure out where the command line tools were hidden. I found them in /usr/lib/applmgmt/support/scripts.

Disclaimer: I work at VMware, but I have no idea if the following is an “acceptable practice” or not. If your production vCenter is broken and you have a support contract, call support. If you’re messing around on a home or test system and you don’t care how badly you screw it up, feel free to try the command line tools.

root@vcenter [ ~ ]# cd /usr/lib/applmgmt/support/scripts
root@vcenter [ /usr/lib/applmgmt/support/scripts ]# ls -al
total 108
drwxr-xr-x 4 root root  4096 Aug 30 18:18 .
drwxr-xr-x 4 root root  4096 Aug 30 18:18 ..
-r-xr-xr-x 1 root root   205 Aug 15 07:16 autogrow.sh
-r-xr-xr-x 1 root root   633 Aug 15 07:16 manifest-verification
-r-xr-xr-x 1 root root   286 Aug 15 07:16 mapping.sh
-r-xr-xr-x 1 root root  2056 Aug 15 07:16 pgtop.py
-r-xr-xr-x 1 root root  3396 Aug 15 07:16 port-accessible.py
drwxr-xr-x 2 root root  4096 Aug 30 18:18 postinstallscripts
-r-xr-xr-x 1 root root  5207 Aug 15 07:16 prestart-applmgmt.sh
-r-xr-xr-x 1 root root  4171 Aug 15 07:16 resize-root.py
-r-xr-xr-x 1 root root   251 Aug 15 07:16 setup-env.sh
-r-xr-xr-x 1 root root  4001 Aug 15 07:16 showlog.py
-r-xr-xr-x 1 root root  3910 Aug 15 07:16 shutdown.py
-r-xr-xr-x 1 root root 35773 Aug 15 07:16 software-packages.py
-r-xr-xr-x 1 root root  8085 Aug 15 07:16 support-bundle.py
drwxr-xr-x 2 root root  4096 Aug 30 18:18 tests

These are the Python scripts that are linked to the Command shell. I’m actually in the root shell. I can run these directly from the root shell, or exit back to the Command shell and use them in the “official” way. In case I need to pull in support let’s do this the official way.

The software-packages.py script is what does the upgrade. Let’s exit back to the Command shell and see what it says it supports.

root@vcenter [ /usr/lib/applmgmt/support/scripts ]# exit
Command> software-packages
usage: software-packages [-h] {stage,unstage,validate,install,list} ...

optional arguments:
  -h, --help            show this help message and exit

sub-commands:
  {stage,unstage,validate,install,list}
    stage               Stage software update packages
    unstage             Purge staged software update packages
    validate            Validate software update packages
    install             Install software update packages
    list                List details of software update packages

Stage the packages for the update

Since the appliance wasn’t letting me upgrade, I thought I’d first check to see if I already have upgrades staged.

Command> software-packages list --staged
 [2021-01-22T21:45:41.022] : Packages not staged

OK. Nothing staged. How do I stage packages?

Command> software-packages stage --help
usage: software-packages stage [-h] [--url [URL]] [--iso] [--acceptEulas] [--thirdParty]

optional arguments:
  -h, --help     show this help message and exit
  --url [URL]    Download software update package from URL. If no url is specified, https://vapp-updates.vmware.com/vai-
                 catalog/valm/vmw/8dc0de9a-feedl-1337-be0a-6ddeadbeefa3/6.7.0.42000.latest/ is used.
  --iso          Load software update packages from CD/DVD drive attached to the appliance
  --acceptEulas  accept all Eulas
  --thirdParty   Stage third party packages.--thirdParty should only be usedwith --url.

Sounds clear enough. I’ll try that:

Command> software-packages stage --url --acceptEulas
 [2021-01-22T21:46:28.022] : Latest updates already installed on VCSA, Nothing to stage

Well that’s not correct. There’s definitely an update available. Re-reading help again I notice that the default URL looks something like:

https://vapp-updates.vmware.com/vai-catalog/valm/vmw/8dc0de9a-feedl-1337-be0a-6ddeadbeefa3/6.7.0.42000.latest/

I’ve obfuscated the actual URL, but that’s a vCenter 6.7.0 URL, I’m using 7.0.0, and I want 7.0.1.

I go back to the appliance web UI and click the Update > Settings button.

vCenter 7 Appliance Update screen

Settings shows a different URL for 7.0.1, so I copy and paste that into the command line:

Command> software-packages stage --acceptEulas --url https://vapp-updates.vmware.com/vai-catalog/valm/vmw/......
 [2021-01-22T21:48:28.022] : Target VCSA version = 7.0.1.00200
 [2021-01-22 21:48:28,781] : Running requirements script.....

Trust but verify

A little while later everything was staged. I decided to validate everything.

Command> software-packages validate
 [2021-01-22T21:50:11.022] : For the first instance of the identity domain, this is the password given to the Administrator account.  Otherwise, this is the password of the Administrator account of the replication partner.
Enter Single Sign-On administrator password:

 [2021-01-22T21:50:22.022] : Validating software update payload
 [2021-01-22 21:50:22,327] : Running validate script.....
 [2021-01-22T21:50:26.022] : Validation successful
 [2021-01-22T21:50:26.022] : Validation process completed successfully

Then I check to see what’s staged:

Command> software-packages list --staged
 [2021-01-22T21:50:45.022] :
        category: Bugfix
        kb: https://docs.vmware.com/en/VMware-vSphere/7.0/rn/vsphere-vcenter-server-70u1c-release-notes.html
        leaf_services: ['vmware-pod', 'vsphere-ui', 'wcp']
        vendor: VMware, Inc.
        name: VC-7.0U1c
        size in MB: 5107
        tags: []
        version_supported: []
        productname: VMware vCenter Server
        releasedate: December 17, 2020
        executeurl: https://my.vmware.com/group/vmware/get-download?downloadGroup=VC70U1C
        version: 7.0.1.00200
        updateversion: True
        allowedSourceVersions: [7.0.0.0,]
        buildnumber: 17327517
        rebootrequired: False
        summary: {'id': 'patch.summary', 'translatable': 'In-place upgrade for vCenter appliances.', 'localized': 'In-place upgrade for vCenter appliances.'}
        type: Update
        severity: Critical
        TPP_ISO: False
        url: https://vapp-updates.vmware.com/vai-catalog/valm/vmw/8dc0de9a-feedl-1337-be0a-6ddeadbeefa3/7.0.0.10700.latest/
        thirdPartyAvailable: False
        nonThirdPartyAvailable: True
        thirdPartyInstallation: False
        timeToInstall: 0
        requiredDiskSpace: {'/storage/core': 30.353511543273928, '/storage/seat': 32.21015625}
        eulaAcceptTime: 2021-01-22 21:48:37 UTC

Well, that shows:

version: 7.0.1.00200

Which is the version I’ve been trying to upgrade to, so that looks good.

Did I mention that I have backup copies of vCenter going back 30 days? Well I do. If this goes really sideways I’m going to have to restore one of them.

Let’s do the update!

Command> software-packages install --staged
 [2021-01-22T21:51:23.022] : For the first instance of the identity domain, this is the password given to the Administrator account.  Otherwise, this is the password of the Administrator account of the replication partner.
Enter Single Sign-On administrator password:

 [2021-01-22T21:51:43.022] : Validating software update payload
 [2021-01-22 21:51:43,716] : Running validate script.....
 [2021-01-22T21:51:47.022] : Validation successful
 [2021-01-22 21:51:47,730] : Copying software packages 251/251
 [2021-01-22 21:55:37,642] : Running system-prepare script.....
 [2021-01-22 21:55:42,661] : Running test transaction ....
 [2021-01-22 21:55:44,678] : Running prepatch script...
....
 [2021-01-22 21:58:27,896] : Upgrading software packages ....
 [2021-01-22T22:02:10.022] : Setting appliance version to 7.0.1.00200 build 17327517
 [2021-01-22 22:02:10,242] : Running patch script.....
 [2021-01-22 22:11:34,245] : Starting all services ....
 [2021-01-22T22:11:35.022] : Services started.
 [2021-01-22T22:11:35.022] : Installation process completed successfully

That was it. The actual update took about 20 minutes, and although the UI said no reboot was necessary vCenter did reboot during the update. When it was done vCenter was running version 7.0.1.00200.

The vCenter appliance Update “Stage Only” and “Stage and Install” buttons are still greyed-out and unselectable, but right now there are no updates available so that’s how they should be. I’ll have to wait for the next update to see if they’re working again. If the buttons are still broken, at least now I know how to use the command line to install an update.

Hope you find this useful.

Share Button
post

Updating the vCenter appliance root password

If you’re like me, you rarely ssh into your vCenter appliance as “root”. However, the time comes when you need to update vCenter, you run the “Pre-Update Checks” — and because you never log into the appliance — you get the message that your root password needs to be updated before you can install the update.

So… log into the vCenter Service Management Console (https://your-vcenter:5480), click Access and then Edit. Make sure that SSH Login, DCLI, Console CLI, and BASH access are all enabled. Set the BASH timeout to 15 minutes so it gets disabled automatically when you’re done.

Once you’ve done that, ssh to the appliance.

$ ssh root@vcenter.labs.earlruby.org

VMware vCenter Server 7.0.0.10700

Type: vCenter Server with an embedded Platform Services Controller

Received disconnect from 192.168.200.11 port 22:2: Too many authentication failures
Disconnected from 192.168.200.11 port 22

Did you get a “Received disconnect … Too many authentication failures” message? Don’t worry, no one is hacking into your vCenter, it’s just that you have more than one ssh key on your keyring and for some reason someone at VMware thought that it would be a great idea to set the vCenter ssh setting MaxAuthTries = 2. Your first ssh key counts as one try, your second ssh key counts as attempt number 2, and… you’re done. vCenter won’t let you log in.

To bypass public key authentication checks entirely use the -o PubkeyAuthentication=no parameter for ssh:

$ ssh -o PubkeyAuthentication=no root@vcenter.labs.earlruby.org

VMware vCenter Server 7.0.0.10700

Type: vCenter Server with an embedded Platform Services Controller

root@vcenter.labs.earlruby.org's password:
Connected to service

    * List APIs: "help api list"
    * List Plugins: "help pi list"
    * Launch BASH: "shell"

Command>

Now get to the bash shell by typing shell, then passwd to set the new password, and you can update the root password:

Command> shell
Shell access is granted to root
root@vcenter [ ~ ]# passwd
New password:
Retype new password:
passwd: password updated successfully
root@vcenter [ ~ ]# exit
Command> exit
Connection to vcenter.labs.earlruby.org closed.

Before you log out, run the Pre-Update Check again to verify that vCenter sees that the password has been updated. This time you should get the message “No issues found. Pre-update checks have passed.”

Hope you find this useful.

Share Button
post

Generate a crypted password for Ansible

The Ansible user: command allows you to add a user to a Linux system with a password. The password must be passed to Ansible in a hashed password format using one of the hash formats supported by /etc/shadow.

Some Ansible docs suggest storing your passwords in plain text and using the Ansible SHA512 filter to hash the plaintext passwords before passing them to the user module. This is a bad practice for a number of reasons.

Storing your passwords in plain text is a bad idea

  • Storing your passwords in plain text is a bad idea, since anyone who can read your Ansible playbook now knows your password.
  • The play is not idempotent, since the SHA512 filter will re-hash the password every time you run the play, resetting the password each time the play is run.
  • Attempting to make the play idempotent, by using update_password: on_create, means that you can no longer update the password using Ansible. This might be OK if you’re just updating one machine. It’s a giant pain in the ass if you need to update the password on many machines.

A better way is to hash the password once using openssl and store the hashed version of the password in your Ansible playbooks:

- name: Set the user's password
  user:
    name: earl
    password: "$6$wLZ77bHhLVJsHaMz$WqJhNW2VefjhnupK0FBj5LDPaONaAMRoiaWle4rU5DkXz7hxhl3Gxcwshuy.KQWRFt6YPWXNbdKq9B/Rk9q7A."

To generate the hashed password use the openssl passwd command on any Linux host:

openssl passwd -6 -stdin

This opens an interactive shell to openssl. Just enter the password that you want to use, hit enter, and openssl will respond with the hashed version. Copy and paste the hashed version of the password into Ansible, and the next time you run Ansible on a host the user’s password will be updated.

Type Ctrl-D to exit the interactive openssl shell.

Since you used the interactive shell the plaintext password that you entered is not saved into the Linux host’s history file and is not visible to anyone running ps.

The crypted password is encrypted with a SHA512 one-way hash and a random 16 character salt, so you can check the playbook into a Git repository without revealing your password.

Hope you find this useful.

Share Button
post

Mouse button Copy & Paste on Ubuntu 20.04

Using the left mouse button to select and copy text in terminals and the middle mouse button to paste has been a feature of X-Windows, and the various window managers built on top of X-Windows, since the early 1990s. With the release of Ubuntu 20.04 and Gnome 3.36 Canonical has removed this convention, forcing a more awkward and slower select, right click, select Copy from a menu, point, right click, select Paste from menu to do the same thing.

If you want to restore select-to-copy, middle button to paste functionality to Ubuntu 20.04 just follow these steps.

Restore select-to-copy functionality

Edit the file .Xresources in your home directory.

Add the line:

xterm*selectToClipboard: true

… to the file, then logout of your desktop and log back in, or reboot.

Once you’ve done that any text that you select in the Terminal program with your left mouse button will be copied to your clipboard. Left click a word and the word is copied to the clipboard. Left click and drag to select and copy an entire line, an entire paragraph, or more.

Restore middle-button paste functionality

Install gnome-tweaks:

sudo apt-get install gnome-tweaks

Click “Activities” in the upper right and search for “tweaks”, click the “Tweaks” icon.

Select “Keyboard & Mouse” and turn “Middle Click Paste” to “on”.

Once you’ve done that, clicking the middle mouse button will paste text from your clipboard back into the terminal.

Hope you find this useful.

Share Button