Calculating the value for 64bitMMIOSizeGB

Posted on 2022-02-05 by Earl C. Ruby III

When adding a GPU to a vSphere VM using PCI passthrough there are a couple of additional settings that you need to make or your VM won’t boot.

When creating the VM you’ll need to set the Actions > Edit > VM Options > Boot Options > Firmware and select “EFI”. You need to do this before you install the operating system on the VM. If you don’t do this the GPUs won’t work and the VM won’t boot.

To add a GPU, in vCenter go to the VM, select Actions > Edit > Add New Device. Any GPUs set up as PCI passthrough devices should appear in a pick list. Add one or more GPUs to your VM.

Note that after adding one device, when you add additional GPUs the first GPU you selected still appears in the pick list. If you add the same GPU more than once your VM will not boot. If you add a GPU that’s being used by another running VM your VM will not boot. Pay attention to the PCI bus addresses displayed and make sure that the GPUs you pick are unique and not in use on another VM.

Finally you have to set up memory-mapped I/O (MMIO) to map system memory to the GPU’s framebuffer memory so that the CPU can pass data to the GPU. In vCenter go to the VM, select Actions > Edit > VM Options > Advanced > Edit configuration.

Once you’re on the Configuration parameters screen, add two more parameters:

pciPassthru.use64bitMMIO = TRUE
pciPassthru.64bitMMIOSizeGB = ????

Actions > Edit > VM Options > Advanced > Edit configuration

The 64bitMMIOSizeGB value is calculated by adding up the total GB of framebuffer memory on all GPUs attached to the VM. If the total GPU framebuffer memory falls on a power-of-2, setting pciPassthru.64bitMMIOSizeGB to the next power of 2 works.

If the total GPU framebuffer memory falls between two powers-of-2, round up to the next power of 2, then round up again, to get a working setting.

Powers of 2 are 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024 …

For example, two NVIDIA A100 cards with 40GB each = 80GB (in between 64GB and 128GB), so round up to the next power of 2 (128GB), then round up again to the next power of 2 after that (256GB) to get the correct setting. If you set it too low the VM won’t boot, but it won’t give you an error message telling you what the issue is either.

Here are some configurations that I’ve tested and verified:

2 x 16GB NVIDIA V100 = 32GB, 32 is a power of 2, so round up to the next power of 2 which is 64, set pciPassthru.64bitMMIOSizeGB = 64 to boot.
2 x 24GB NVIDIA P40 = 48GB, which is in-between 32 and 64, round up to 64 and again to 128, requires pciPassthru.64bitMMIOSizeGB = 128 to boot.
8 x 16GB NVIDIA V100 = 128GB, 128 is a power of 2, so round up to the next power of 2 which is 256, set pciPassthru.64bitMMIOSizeGB = 256 to boot.
10 x 16GB NVIDIA V100 = 160GB, which is in-between 128 and 256, round up to 256 and again to 512, set pciPassthru.64bitMMIOSizeGB = 512 to boot.

Hope you find this useful.

Updating ESXi root passwords and authorized ssh keys with Ansible

Posted on 2021-10-06 by Earl C. Ruby III

I manage a number of vCenter instances and a lot of ESXi hosts. Some of the hosts are production, some for test and development. Sometimes an ESXi host needs to be used by a different group or temporarily moved to a new cluster and then back again afterwards.

To automate the configuration of these systems and the VMs running on them I use Ansible. For a freshly-imaged, new installation of ESXi one of the first things I do it to run an Ansible playbook that sets up the ESXi host, and the first thing it does is to install the ssh keys of the people who need to log in as root, then it updates the root password.

I have ssh public keys for every user that needs root access. A short bash script combines those keys and my Ansible management public key into authorized_keys files for the ESXi hosts in each vCenter instance. In my Ansible group_vars/ directory is a file for each group of ESXi hosts, so all of the ESXi hosts in a group get the same root password and ssh keys. This also makes it easy to change root passwords and add and remove ssh keys of users as they are added to or leave different groups.

Here’s a portion of a group_vars/esxi_hosts_cicd/credentials.yml file for a production CICD cluster:

# ESXI Hosts (only Ops can ssh in)
esxi_root_authorized_keys_file: authkeys-ops

esxi_username: 'root'
esxi_password: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          34633832366431383630653735663739636466316262
          39363165663566323864373930386239380085373464
          32383863366463653365383533646437656664376365
          31623564336165626162616263613166643462356462
          34633832366431383630653735663739636466316262
          39363165663566323864373930386239380085373464
          32383863366463653365383533646437656664376365
          31623564336165626162616263613166643462356462
          3061

The password is encrypted using Ansible Vault.

In my main.yml file I call the esxi_host role for all of the hosts in the esxi_hosts inventory group. Since I use a different user to manage non-ESXi hosts, the play that calls the role tells Ansible to use the root user only when logging into ESXi hosts.

- name: Setup esxi_hosts
  gather_facts: False
  user: root
  hosts: esxi_hosts
  roles:
    - esxi_host

The esxi_host role has an esxi_host/tasks/main.yml playbook. The two plays that update the authorized_keys file and root password look like this:

- name: Set the authorized ssh keys for the root user
  copy:
    src: "{{ esxi_root_authorized_keys_file }}"
    dest: /etc/ssh/keys-root/authorized_keys
    owner: root
    group: root
    mode: '0600'

- name: Set the root password for ESXI Hosts
  shell: "echo '{{ esxi_password }}' | passwd -s"
  no_log: True

The first time I run this the password is set to some other value, so I start Ansible with:

ansible-playbook main.yml \
    --vault-id ~/path/to/vault/private/key/file \
    -i inventory/ \
    --limit [comma-separated list of new esxi hosts] \
    --ask-pass \
    --ask-become-pass

This will prompt me for the current root ssh password. Once I enter that it logs into each ESXi host, installs the new authorized_keys file, uses the vault private key to decrypt the password, then updates the root password.

After I’ve done this once, since the Ansible ssh key is also part of the authorized_keys file, subsequent Ansible updates just use the ssh key to login, and I don’t have to use --ask-pass or --ask-become-pass parameters.

This is also handy when switching a host from one cluster to another. As long as the ssh keys are installed I no longer need the current root password to update the root password.

Hope you find this useful.

Setting up a 100GbE PVRDMA Network on vCenter 7

Posted on 2021-06-20 by Earl C. Ruby III

After writing my last article on Getting NVIDIA NGC containers to work with VMware PVRDMA networks I had a couple of people ask me “How do I set up PVRDMA networking on vCenter?” These are the steps that I took to set up PVRDMA networking in my lab.

RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. It works by encapsulating an Infiniband (IB) transport packet and sending it over Ethernet. If you’re working with network applications that require high bandwidth and low latency, RDMA will give you lower latency, higher bandwidth, and a lower CPU load than an API such as Berkeley sockets.

Full disclosure: I used to work for a startup called Bitfusion, and that startup was bought by VMware, so I now work for VMware. At Bitfusion we developed a technology for accessing hardware accelerators, such as NVIDIA GPUs, remotely across networks using TCP/IP, Infiniband, and PVRDMA. I still work on the Bitfusion product at VMware, and spend a lot of my time getting AI and ML workloads to work across networks on virtualized GPUs.

In my lab I’m using Mellanox Connect/X5 and ConnectX/6 cards on hosts that are running ESXi 7.0.2 and vCenter 7.0.2. The cards are connected to a Mellanox Onyx MSN2700 100GbE switch.

Since I’m working with Ubuntu 18.04 and 20.04 virtual machines (VMs) in a vCenter environment, I have a couple of options for high-speed networking:

I can use PCI passthrough to pass the PCI network card directly through to the VM and use the network card’s native drivers on the VM to set up a networking stack. However this means that my network card is only available to a single VM on the host, and can’t be shared between VMs. It also breaks vMotion (the ability to live-migrate the VM to another host) since the VM is tied to a specific piece of hardware on a specific host. I’ve set this up in my lab but stopped doing this because of the lack of flexibility and because we couldn’t identify any performance difference compared to SR-IOV networking.
I can use SR-IOV and Network Virtual Functions (NVFs) to make the single card appear as if it’s multiple network cards with multiple PCI addresses, pass those through to the VM, and use the network card’s native drivers on the VM to set up a networking stack. I’ve set this up in my lab as well. I can share a single card between multiple VMs and the performance is similar to PCI passthough. The disadvantages are that setting up SR-IOV and configuring the NVFs is specific to a card’s model and manufacturer, so what works in my lab might not work in someone else’s environment.
I can set up PVRDMA networking and use the PVRDMA driver that comes with Ubuntu. This is what I’m going to show how to do in this article.

Set up your physical switch

First, make sure that your switch is set up correctly. On my Mellanox Onyx MSN2700 100GbE switch that means:

Enable the ports you’re connecting to.
Set the speed of each port to 100G.
Set auto-negotiation for each link.
MTU: 9000
Flowcontrol Mode: Global
LAG/MLAG: No
LAG Mode: On

Set up your virtual switch

vCenter supports Paravirtual RDMA (PVRDMA) networking using Distributed Virtual Switches (DVS). This means you’re setting up a virtual switch in vCenter and you’ll connect your VMs to this virtual switch.

In vCenter navigate to Hosts and Clusters, then click the DataCenter icon (looks like a sphere or globe with a line under it). Find the cluster you want to add the virtual switch to, right click on the cluster and select Distributed Switch > New Distributed Switch.

Name: “rdma-dvs”
Version: 7.0.2 – ESXi 7.0.2 and later
Number of uplinks: 4
Network I/O control: Disabled
Default port group: Create
Port Group Name: “VM 100GbE Network”
VLAN Type: VLAN (If you are using a VLAN)
VLAN ID: (the VLAN ID associated with the subnet you’re using for this network)

Figure out which NIC is the right NIC

Go to Hosts and Clusters
Select the host
Click the Configure tab, then Networking > Physical adapters
Note which NIC is the 100GbE NIC for each host

Add Hosts to the Distributed Virtual Switch

Go to Hosts and Clusters
Click the DataCenter icon
Select the Networks top tab and the Distributed Switches sub-tab
Right click “rdma-dvs”
Click “Add and Manage Hosts”
Select “Add Hosts”
Select the hosts. Use “auto” for uplinks.
Select the physical adapters based on the list you created in the previous step, or find the Mellanox card in the list and add it. If more than one is listed, look for the card that’s “connected”.
Manage VMkernel adapters (accept defaults)
Migrate virtual machine networking (none)

Tag a vmknic for PVRDMA

PVRDMA requires an out of band (OOB) communication channel (outside of the RDMA protocol) to exchange information that enables virtualization to work. The ESXi Net.PVRDMAVmknic setting determines which vmknic the OOB communication happens on. It has no effect on the data path or on other vmk services — vMotion, vSAN, Provisioning, Management, etc. — those are turned on or off on a per-vmk basis.

Select an ESXi host and go to the Configure tab
Go to System > Advanced System Settings
Click Edit
Filter on “PVRDMA”
Set Net.PVRDMAVmknic = "vmk0"

Repeat for each ESXi host.

Set up the firewall for PVRDMA

Select an ESXi host and go to the Configure tab
Go to System > Firewall
Click Edit
Scroll down to find pvrdma and check the box to allow PVRDMA traffic through the firewall.

Repeat for each ESXi host.

Set up Jumbo Frames for PVRDMA

To enable jumbo frames a vCenter cluster using virtual switches you have to set MTU 9000 on the Distributed Virtual Switch.

Click the Data Center icon.
Click the Distributed Virtual Switch that you want to set up, “rdma-dvs” in this example.
Go to the Configure tab.
Select Settings > Properties.
Look at Properties > Advanced > MTU. This should be set to 9000. If it’s not, click Edit.
Click Advanced.
Set MTU to 9000.
Click OK.

Add a PVRDMA NIC to a VM

Edit the VM settings
Add a new device
Select “Network Adapter”
Pick “VM 100GbE Network” for the network.
Connect at Power On (checked)
Adapter type PVRDMA (very important!)
Device Protocol: RoCE v2

Configure the VM

For Ubuntu:

sudo apt-get install rdma-core infiniband-diags ibverbs-utils

Tweak the module load order

In order for RDMA to work the vmw_pvrdma module has to be loaded after several other modules. Maybe someone else knows a better way to do this, but the method that I got to work was adding a script /usr/local/sbin/rdma-modules.sh to ensure that Infiniband modules are loaded on boot, then calling that from /etc/rc.local so it gets executed at boot time.

#!/bin/bash
# rdma-modules.sh
# modules that need to be loaded for PVRDMA to work
/sbin/modprobe mlx4_ib
/sbin/modprobe ib_umad
/sbin/modprobe rdma_cm
/sbin/modprobe rdma_ucm

# Once those are loaded, reload the vmw_pvrdma module
/sbin/modprobe -r vmw_pvrdma
/sbin/modprobe vmw_pvrdma

Once that’s done just set up the PVRDMA network interface the same as any other network interface.

Testing the network

To verify that I’m getting something close to 100Gbps on the network I use the perftest package.

To test bandwith I pick two VMs on different hosts. On one VM I run:

$ ib_send_bw --report_gbits

On the other VM I run the same command plus I add the IP address of the PVRDMA interface on the first machine:

$ ib_send_bw --report_gbits 192.168.128.39

That sends a bunch of data across the network and reports back:

So I’m getting an average of 96.31Gbps over the network connection.

I can also check the latency using the ib_send_lat:

Hope you find this useful.

Getting NVIDIA NGC containers to work with VMware PVRDMA networks

Posted on 2021-04-01 by Earl C. Ruby III

NVIDIA publishes a set of NVIDIA GPU-accelerated Containers (NGC) with applications and frameworks for machine learning, deep learning, and high-performance computing.

VMware developed a platform that allows people and companies to create their own private clouds. For customers with high-speed, low-latency networking requirements they offer a couple of different networking options, one of which is PVRDMA (ParaVirtualized Remote Direct Memory Access) networking.

OpenFabrics Enterprise Distribution (OFED) is open-source software for RDMA applications which includes a set of drivers for high-speed network cards to enable RDMA/Infiniband networking. Some NVIDIA NGC containers ship with Mellanox OFED (MOFED) installed. NVIDIA bought Mellanox in 2020, and MOFED is NVIDIA’s distribution of OFED with all of the non-Mellanox drivers removed. OFED includes support for PVRDMA, but MOFED does not.

NVIDIA containers are based on Ubuntu base images. Ubuntu ships its own RDMA drivers in a package called rdma-core. The Ubuntu rdma-core package contains the open source drivers and utilities needed to work with VMware PVRDMA networking.

The Ubuntu rdma-core package contains the open source drivers and utilities needed to work with VMware PVRDMA networking.

Ideally you should only install the RDMA network package that you need, either MOFED or OFED or rdma-core, but not more than one of them. In fact, if you try installing more than one you will have problems. Therefore, if you’re going to use NGC containers on a PVRDMA network you should first remove the MOFED packages and then add the rdma-core packages.

Luckily you can start an NGC container and see if MOFED is installed or not and see what version is installed. If I start the NGC container for Tensor RT:

docker run -it --rm -u root nvcr.io/nvidia/tensorrt:19.09-py3

I can see that it’s based on Ubuntu 18.04 “bionic”:

root@2e70d41e1187:/workspace# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

If I look inside /opt/mellanox/DEBS/ I can see if any MOFED .deb files are installed:

root@2e70d41e1187:/workspace# ls -al /opt/mellanox/DEBS/
total 64
drwxrwxr-x 15 root root 4096 Aug 27  2019 .
drwxr-xr-x  3 root root 4096 Sep 13  2019 ..
drwxrwxr-x  2 root root 4096 Aug 27  2019 3.4-1.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 3.4-2.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.0-1.0.1
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.0-2.0.0
lrwxrwxrwx  1 root root    9 Aug 27  2019 4.0-2.0.2 -> 4.0-2.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.1-1.0.2
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.2-1.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.2-1.2.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.3-1.0.1
lrwxrwxrwx  1 root root    9 Aug 27  2019 4.3-3.0.2 -> 4.3-1.0.1
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.4-1.0.0
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.4-2.0.7
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.5-1.0.1
drwxrwxr-x  2 root root 4096 Aug 27  2019 4.6-1.0.1
lrwxrwxrwx  1 root root    9 Aug 27  2019 5.0-0 -> 5.0-1.1.8
drwxrwxr-x  2 root root 4096 Aug 27  2019 5.0-1.1.8
-rwxrwxr-x  1 root root  546 Aug 27  2019 add_mofed_version.sh

In this case there are Mellanox MOFED packages installed. If I look inside these directories (ls -1 /opt/mellanox/DEBS/*) I can see that the packages installed from MOFED are:

ibverbs-utils
libibverbs-dev
libibverbs1
libmlx5-1

These are MOFED versions of packages installed in this specific container. A different NGC container might contain these MOFED packages, or different MOFED packages, or no MOFED packages at all.

There are versions of these same packages in Ubuntu repos, and the Ubuntu versions conflict with the MOFED versions. To use the Ubuntu versions, first remove the MOFED packages:

root@2e70d41e1187:/workspace# apt-get purge -y ibverbs-utils libibverbs-dev libibverbs1 libmlx5-1
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be REMOVED:
  ibverbs-utils* libibverbs-dev* libibverbs1* libmlx5-1*
0 upgraded, 0 newly installed, 4 to remove and 23 not upgraded.
After this operation, 1523 kB disk space will be freed.
(Reading database ... 18622 files and directories currently installed.)
Removing ibverbs-utils (41mlnx1-OFED.4.4.1.0.0.44100) ...
Removing libibverbs-dev (41mlnx1-OFED.4.4.1.0.0.44100) ...
Removing libmlx5-1 (41mlnx1-OFED.4.4.0.1.7.44100) ...
Removing libibverbs1 (41mlnx1-OFED.4.4.1.0.0.44100) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
(Reading database ... 18449 files and directories currently installed.)
Purging configuration files for libmlx5-1 (41mlnx1-OFED.4.4.0.1.7.44100) ...

You can see in the output above that the packages that I removed have the name “OFED” in them, indicating that they came from MOFED/OFED, not Ubuntu. If I reinstall using rdma-core and the other packages I need:

apt-get update && apt-get install -y --reinstall \
    -t bionic rdma-core libibverbs1 ibverbs-providers \
    infiniband-diags ibverbs-utils libcapstone3

This installs everything from the Ubuntu repositories for the “bionic” version, which is the version of Ubuntu that this NGC container is based on. (Which we determined back in step 1.)

The -t flag is necessary because I’ve found that some NGC containers mix code from the repositories of different versions of Ubuntu, and we only want to install packages from the base Ubuntu version, which is “bionic” in this particular case.

At this point the container is ready to use PVRDMA connections.

However, I also want to connect to a remote Bitfusion server across a PVRDMA network and use a pool of GPUs for my TensorRT work, so I also install the Bitfusion client:

wget https://packages.vmware.com/bitfusion/ubuntu/18.04/bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

apt-get install -y ./bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

To create a new container with all of these changes I just have to whip up a small Dockerfile:

# Base this container on the NGC container you want to use
FROM nvcr.io/nvidia/tensorrt:19.09-py3

# Remove the MOFED packages that are installed,
# determined by running “ls -1 /opt/mellanox/DEBS/*”
RUN apt-get purge -y ibverbs-utils libibverbs-dev \
    libibverbs1 libmlx5-1

# Install the Ubuntu RDMA packages using the
# UBUNTU_CODENAME from /etc/os-release
# as the -t argument.
RUN apt-get update && apt-get install -y --reinstall \
    -t bionic \
    rdma-core libibverbs1 ibverbs-providers \
    infiniband-diags ibverbs-utils libcapstone3

# Install the Bitfusion 3.0.0 client software for Ubuntu 18.04
RUN wget https://packages.vmware.com/bitfusion/ubuntu/18.04/bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

RUN apt-get install -y ./bitfusion-client-ubuntu1804_3.0.0-11_amd64.deb

To build an image using this Dockerfile:

mkdir -p ~/build
docker build -t tensorrt:19.09-py3-pvrdma -f Dockerfile ~/build

Run this image:

docker run -it --rm -u root --network host \
    tensorrt:19.09-py3-pvrdma

In this instance I’m passing the host’s network through to the container. Assuming that the host already has PVRDMA networking set up correctly, I can use that PVRDMA network inside the NGC container. With the Bitfusion client in the container I can run TensorRT and access GPUs from a remote pool of GPUs across a PVRDMA network.

Hope you find this useful.

You may also be interested in my article Setting up a 100GbE PVRDMA Network on vCenter 7.

Upgrading vCenter 7 via the command line

Posted on 2021-01-25 by Earl C. Ruby III

Updated on 2021-10-26.

I have vCenter 7.0.0.10700 installed and I want to update to 7.0.1.00200. When I run Update Planner > Interoperability it reports that all of my ESXi hosts are running ESXi 7.0.1. If I run the pre-update checks I get “No issues found”. When I go to the appliance to do the upgrade, both “Stage Only” and “Stage and Install” are greyed-out and unselectable.

vCenter 7 Appliance Available Updates screen

I tried a dozen different tricks, including ssh-ing into the appliance as root and editing the /etc/applmgmt/appliance/software_update_state.conf file, but nothing could enable the “Stage Only” and “Stage and Install” buttons.

Use the command line

I finally decided to try upgrading via the command line. I have backups going back 30 days. I even double-checked and yes, my NFS server has files in the backup directory for each of the past 30 days and they have data in them. There’s probably even a way to restore one of those backups if something goes horribly wrong. Onwards!

I was already logged into the vCenter appliance shell as root. The next thing I needed to do was to figure out where the command line tools were hidden. I found them in /usr/lib/applmgmt/support/scripts.

Disclaimer: I work at VMware, but I have no idea if the following is an “acceptable practice” or not. If your production vCenter is broken and you have a support contract, call support. If you’re messing around on a home or test system and you don’t care how badly you screw it up, feel free to try the command line tools.

root@vcenter [ ~ ]# cd /usr/lib/applmgmt/support/scripts
root@vcenter [ /usr/lib/applmgmt/support/scripts ]# ls -al
total 108
drwxr-xr-x 4 root root  4096 Aug 30 18:18 .
drwxr-xr-x 4 root root  4096 Aug 30 18:18 ..
-r-xr-xr-x 1 root root   205 Aug 15 07:16 autogrow.sh
-r-xr-xr-x 1 root root   633 Aug 15 07:16 manifest-verification
-r-xr-xr-x 1 root root   286 Aug 15 07:16 mapping.sh
-r-xr-xr-x 1 root root  2056 Aug 15 07:16 pgtop.py
-r-xr-xr-x 1 root root  3396 Aug 15 07:16 port-accessible.py
drwxr-xr-x 2 root root  4096 Aug 30 18:18 postinstallscripts
-r-xr-xr-x 1 root root  5207 Aug 15 07:16 prestart-applmgmt.sh
-r-xr-xr-x 1 root root  4171 Aug 15 07:16 resize-root.py
-r-xr-xr-x 1 root root   251 Aug 15 07:16 setup-env.sh
-r-xr-xr-x 1 root root  4001 Aug 15 07:16 showlog.py
-r-xr-xr-x 1 root root  3910 Aug 15 07:16 shutdown.py
-r-xr-xr-x 1 root root 35773 Aug 15 07:16 software-packages.py
-r-xr-xr-x 1 root root  8085 Aug 15 07:16 support-bundle.py
drwxr-xr-x 2 root root  4096 Aug 30 18:18 tests

These are the Python scripts that are linked to the Command shell. I’m actually in the root shell. I can run these directly from the root shell, or exit back to the Command shell and use them in the “official” way. In case I need to pull in support let’s do this the official way.

The software-packages.py script is what does the upgrade. Let’s exit back to the Command shell and see what it says it supports.

root@vcenter [ /usr/lib/applmgmt/support/scripts ]# exit
Command> software-packages
usage: software-packages [-h] {stage,unstage,validate,install,list} ...

optional arguments:
  -h, --help            show this help message and exit

sub-commands:
  {stage,unstage,validate,install,list}
    stage               Stage software update packages
    unstage             Purge staged software update packages
    validate            Validate software update packages
    install             Install software update packages
    list                List details of software update packages

Stage the packages for the update

Since the appliance wasn’t letting me upgrade, I thought I’d first check to see if I already have upgrades staged.

Command> software-packages list --staged
 [2021-01-22T21:45:41.022] : Packages not staged

OK. Nothing staged. How do I stage packages?

Command> software-packages stage --help
usage: software-packages stage [-h] [--url [URL]] [--iso] [--acceptEulas] [--thirdParty]

optional arguments:
  -h, --help     show this help message and exit
  --url [URL]    Download software update package from URL. If no url is specified, https://vapp-updates.vmware.com/vai-
                 catalog/valm/vmw/8dc0de9a-feedl-1337-be0a-6ddeadbeefa3/6.7.0.42000.latest/ is used.
  --iso          Load software update packages from CD/DVD drive attached to the appliance
  --acceptEulas  accept all Eulas
  --thirdParty   Stage third party packages.--thirdParty should only be usedwith --url.

Sounds clear enough. I’ll try that:

Command> software-packages stage --url --acceptEulas
 [2021-01-22T21:46:28.022] : Latest updates already installed on VCSA, Nothing to stage

Well that’s not correct. There’s definitely an update available. Re-reading help again I notice that the default URL looks something like:

https://vapp-updates.vmware.com/vai-catalog/valm/vmw/8dc0de9a-feedl-1337-be0a-6ddeadbeefa3/6.7.0.42000.latest/

I’ve obfuscated the actual URL, but that’s a vCenter 6.7.0 URL, I’m using 7.0.0, and I want 7.0.1.

I go back to the appliance web UI and click the Update > Settings button.

Settings shows a different URL for 7.0.1, so I copy and paste that into the command line:

Command> software-packages stage --acceptEulas --url https://vapp-updates.vmware.com/vai-catalog/valm/vmw/......
 [2021-01-22T21:48:28.022] : Target VCSA version = 7.0.1.00200
 [2021-01-22 21:48:28,781] : Running requirements script.....

Update as of 2021-09-21: I just found out about the update.get and update.set commands, used to find and set the default URL used for downloading updates on the command line.

If you type:

update.get

… you’ll get the Currenturl (set when you first installed vCenter) and the Defaulturl (what you should be using to update vCenter). If you then type:

update.set --currentURL default

The Currenturl gets set to the Defaulturl. After that you can type:

software-packages stage --url --acceptEulas

… and the software gets staged from the Currenturl, which is the same URL used by the vCenter GUI.

Installing a specific version of vCenter

Update as of 2021-10-26: The steps shown above are fine if you want to stage the latest update, but what if you want a specific version of vCenter, not the latest?

Right now I’ve got a vCenter 7.0.2.00500 and there are two updates available, 7.0.3.00000 and 7.0.3.00100. If I run update.get:

Command> update.get
Config:
Currenturl: https://vapp-updates.vmware.com/vai-catalog/valm/vmw/8dc0de9a-feedl-1337-be0a-6ddeadbeefa3/7.0.2.00500.latest/
Defaulturl: https://vapp-updates.vmware.com/vai-catalog/valm/vmw/8dc0de9a-feedl-1337-be0a-6ddeadbeefa3/7.0.2.00500.latest/
Checkupdates: disabled
Time: 00:00:00
Day: Everyday
Latestupdateinstalltime: 2021-09-23T00:03:48.493Z
Latestupdatequerytime: ''
Username: ''
Password: ''

(License number obfuscated in the above URLs, use your own.)

Note the “.latest” at the end of the URLs. If I use that URL for staging, but change the version to the specific version that I want (without the .latest extension):

software-packages stage --url https://vapp-updates.vmware.com/vai-catalog/valm/vmw/8d167796-34d5-4899-be0a-6daade4005a3/7.0.3.00000/

I’ve just staged 7.0.3.00000 for install, and that’s the version that will be installed, even though there’s a later 7.0.3.00100 version available.

Trust but verify

A little while later everything was staged. I decided to validate everything.

Command> software-packages validate
 [2021-01-22T21:50:11.022] : For the first instance of the identity domain, this is the password given to the Administrator account.  Otherwise, this is the password of the Administrator account of the replication partner.
Enter Single Sign-On administrator password:

 [2021-01-22T21:50:22.022] : Validating software update payload
 [2021-01-22 21:50:22,327] : Running validate script.....
 [2021-01-22T21:50:26.022] : Validation successful
 [2021-01-22T21:50:26.022] : Validation process completed successfully

Then I check to see what’s staged:

Command> software-packages list --staged
 [2021-01-22T21:50:45.022] :
        category: Bugfix
        kb: https://docs.vmware.com/en/VMware-vSphere/7.0/rn/vsphere-vcenter-server-70u1c-release-notes.html
        leaf_services: ['vmware-pod', 'vsphere-ui', 'wcp']
        vendor: VMware, Inc.
        name: VC-7.0U1c
        size in MB: 5107
        tags: []
        version_supported: []
        productname: VMware vCenter Server
        releasedate: December 17, 2020
        executeurl: https://my.vmware.com/group/vmware/get-download?downloadGroup=VC70U1C
        version: 7.0.1.00200
        updateversion: True
        allowedSourceVersions: [7.0.0.0,]
        buildnumber: 17327517
        rebootrequired: False
        summary: {'id': 'patch.summary', 'translatable': 'In-place upgrade for vCenter appliances.', 'localized': 'In-place upgrade for vCenter appliances.'}
        type: Update
        severity: Critical
        TPP_ISO: False
        url: https://vapp-updates.vmware.com/vai-catalog/valm/vmw/8dc0de9a-feedl-1337-be0a-6ddeadbeefa3/7.0.0.10700.latest/
        thirdPartyAvailable: False
        nonThirdPartyAvailable: True
        thirdPartyInstallation: False
        timeToInstall: 0
        requiredDiskSpace: {'/storage/core': 30.353511543273928, '/storage/seat': 32.21015625}
        eulaAcceptTime: 2021-01-22 21:48:37 UTC

Well, that shows:

version: 7.0.1.00200

Which is the version I’ve been trying to upgrade to, so that looks good.

Did I mention that I have backup copies of vCenter going back 30 days? Well I do. If this goes really sideways I’m going to have to restore one of them.

Let’s do the update!

Command> software-packages install --staged
 [2021-01-22T21:51:23.022] : For the first instance of the identity domain, this is the password given to the Administrator account.  Otherwise, this is the password of the Administrator account of the replication partner.
Enter Single Sign-On administrator password:

 [2021-01-22T21:51:43.022] : Validating software update payload
 [2021-01-22 21:51:43,716] : Running validate script.....
 [2021-01-22T21:51:47.022] : Validation successful
 [2021-01-22 21:51:47,730] : Copying software packages 251/251
 [2021-01-22 21:55:37,642] : Running system-prepare script.....
 [2021-01-22 21:55:42,661] : Running test transaction ....
 [2021-01-22 21:55:44,678] : Running prepatch script...
....
 [2021-01-22 21:58:27,896] : Upgrading software packages ....
 [2021-01-22T22:02:10.022] : Setting appliance version to 7.0.1.00200 build 17327517
 [2021-01-22 22:02:10,242] : Running patch script.....
 [2021-01-22 22:11:34,245] : Starting all services ....
 [2021-01-22T22:11:35.022] : Services started.
 [2021-01-22T22:11:35.022] : Installation process completed successfully

That was it. The actual update took about 20 minutes, and although the UI said no reboot was necessary vCenter did reboot during the update. When it was done vCenter was running version 7.0.1.00200.

The vCenter appliance Update “Stage Only” and “Stage and Install” buttons are still greyed-out and unselectable, but right now there are no updates available so that’s how they should be. I’ll have to wait for the next update to see if they’re working again. If the buttons are still broken, at least now I know how to use the command line to install an update.

Hope you find this useful.

“Package discrepency error, Cannot resume!”

Update as of 2021-06-30: I have successfully upgraded a couple of times since I wrote this article using the GUI and the “Stage Only” and “Stage and Install” buttons are no longer greyed out when an update is available.

I did run into an issue upgrading from 7.0.2.00000 to 7.0.2.00100 where I got the error “Package discrepency error, Cannot resume!” [sic] when I tried to stage the update. Also when upgrading from 7.0.2.00100 to 7.0.2.002.00. Both times I resolved the error and got the upgrades to install by following the steps in William Lam’s article Stage Only & Stage and Install buttons disabled when updating to vSphere 7.0 Update 2a. According to William these steps will need to be repeated until 7.0.3 is released:

Command> shell
rm -rf /storage/core/software-update/updates
rm -rf /storage/updatemgr/software-*
rm /etc/applmgmt/appliance/software_update_state.conf
rm /storage/db/patching.db*
rm -r /storage/core/software-update/*

Update as of 2021-10-26: I tried the UI today to upgrade from vCenter 7.0.2.00500 to 7.0.3.00000, and the UI still failed, so I used the command line to upgrade to 7.0.3.00000.

Once 7.0.3.00000 was installed I was able to upgrade to 7.0.3.00100 using the UI, so it looks like the UI problem has been resolved in 7.0.3 as William said it would be.

“Test transaction failed to update packages”

Update as of 2021-09-21: I was upgrading a couple of vCenter instances today to the latest 7.0.2.00500 release and on one vCenter I got the error:

 [2021-09-21T17:35:56.264] : Validating software update payload
 [2021-09-21T17:35:56.264] : UpdateInfo: Using product version 7.0.2.00200 and build 17958471
 [2021-09-21 17:35:56,064] : Running validate script.....
 [2021-09-21T17:36:00.264] : Validation successful
 [2021-09-21 17:36:00,084] : Copying software packages 152/152
 [2021-09-21 17:55:01,033] : Running system-prepare script.....
 [2021-09-21 17:55:06,053] : Running test transaction ....
 [2021-09-21T17:55:07.264] : Installation process failed
 [2021-09-21T17:55:07.264] : Test transaction failed to update packages

“Test transaction failed to update packages” means something failed with the package install, so I read through /var/log/vmware/applmgmt/software-packages.log and looked for lines with ERR in them. Found out that I ran out of log space in /storage/log. Once I freed up some space I re-ran the update and it installed fine.