post

Setting up a 100GbE PVRDMA Network on vCenter 7

After writing my last article on Getting NVIDIA NGC containers to work with VMware PVRDMA networks I had a couple of people ask me “How do I set up PVRDMA networking on vCenter?” These are the steps that I took to set up PVRDMA networking in my lab.

RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. It works by encapsulating an Infiniband (IB) transport packet and sending it over Ethernet. If you’re working with network applications that require high bandwidth and low latency, RDMA will give you lower latency, higher bandwidth, and a lower CPU load than an API such as Berkeley sockets.

Full disclosure: I used to work for a startup called Bitfusion, and that startup was bought by VMware, so I now work for VMware. At Bitfusion we developed a technology for accessing hardware accelerators, such as NVIDIA GPUs, remotely across networks using TCP/IP, Infiniband, and PVRDMA. I still work on the Bitfusion product at VMware, and spend a lot of my time getting AI and ML workloads to work across networks on virtualized GPUs.

In my lab I’m using Mellanox Connect/X5 and ConnectX/6 cards on hosts that are running ESXi 7.0.2 and vCenter 7.0.2. The cards are connected to a Mellanox Onyx MSN2700 100GbE switch.

Since I’m working with Ubuntu 18.04 and 20.04 virtual machines (VMs) in a vCenter environment, I have a couple of options for high-speed networking:

  • I can use PCI passthrough to pass the PCI network card directly through to the VM and use the network card’s native drivers on the VM to set up a networking stack. However this means that my network card is only available to a single VM on the host, and can’t be shared between VMs. It also breaks vMotion (the ability to live-migrate the VM to another host) since the VM is tied to a specific piece of hardware on a specific host. I’ve set this up in my lab but stopped doing this because of the lack of flexibility and because we couldn’t identify any performance difference compared to SR-IOV networking.
  • I can use SR-IOV and Network Virtual Functions (NVFs) to make the single card appear as if it’s multiple network cards with multiple PCI addresses, pass those through to the VM, and use the network card’s native drivers on the VM to set up a networking stack. I’ve set this up in my lab as well. I can share a single card between multiple VMs and the performance is similar to PCI passthough. The disadvantages are that setting up SR-IOV and configuring the NVFs is specific to a card’s model and manufacturer, so what works in my lab might not work in someone else’s environment.
  • I can set up PVRDMA networking and use the PVRDMA driver that comes with Ubuntu. This is what I’m going to show how to do in this article.

Set up your physical switch

First, make sure that your switch is set up correctly. On my Mellanox Onyx MSN2700 100GbE switch that means:

  • Enable the ports you’re connecting to.
  • Set the speed of each port to 100G.
  • Set auto-negotiation for each link.
  • MTU: 9000
  • Flowcontrol Mode: Global
  • LAG/MLAG: No
  • LAG Mode: On

Set up your virtual switch

vCenter supports Paravirtual RDMA (PVRDMA) networking using Distributed Virtual Switches (DVS). This means you’re setting up a virtual switch in vCenter and you’ll connect your VMs to this virtual switch.

In vCenter navigate to Hosts and Clusters, then click the DataCenter icon (looks like a sphere or globe with a line under it). Find the cluster you want to add the virtual switch to, right click on the cluster and select Distributed Switch > New Distributed Switch.

  • Name: “rdma-dvs”
  • Version: 7.0.2 – ESXi 7.0.2 and later
  • Number of uplinks: 4
  • Network I/O control: Disabled
  • Default port group: Create
  • Port Group Name: “VM 100GbE Network”

Figure out which NIC is the right NIC

  • Go to Hosts and Clusters
  • Select the host
  • Click the Configure tab, then Networking > Physical adapters
  • Note which NIC is the 100GbE NIC for each host

Add Hosts to the Distributed Virtual Switch

  • Go to Hosts and Clusters
  • Click the DataCenter icon
  • Select the Networks top tab and the Distributed Switches sub-tab
  • Right click “rdma-dvs”
  • Click “Add and Manage Hosts”
  • Select “Add Hosts”
  • Select the hosts. Use “auto” for uplinks.
  • Select the physical adapters based on the list you created in the previous step, or find the Mellanox card in the list and add it. If more than one is listed, look for the card that’s “connected”.
  • Manage VMkernel adapters (accept defaults)
  • Migrate virtual machine networking (none)

Tag a vmknic for PVRDMA

  • Select an ESXi host and go to the Configure tab
  • Go to System > Advanced System Settings
  • Click Edit
  • Filter on “PVRDMA”
  • Set Net.PVRDMAVmknic = "vmk0"

Repeat for each ESXi host.

Set up the firewall for PVRDMA

  • Select an ESXi host and go to the Configure tab
  • Go to System > Firewall
  • Click Edit
  • Scroll down to find pvrdma and check the box to allow PVRDMA traffic through the firewall.

Repeat for each ESXi host.

Set up Jumbo Frames for PVRDMA

To enable jumbo frames a vCenter cluster using virtual switches you have to set MTU 9000 on the Distributed Virtual Switch.

  • Click the Data Center icon.
  • Click the Distributed Virtual Switch that you want to set up, “rdma-dvs” in this example.
  • Go to the Configure tab.
  • Select Settings > Properties.
  • Look at Properties > Advanced > MTU. This should be set to 9000. If it’s not, click Edit.
  • Click Advanced.
  • Set MTU to 9000.
  • Click OK.

Add a PVRDMA NIC to a VM

  • Edit the VM settings
  • Add a new device
  • Select “Network Adapter”
  • Pick “VM 100GbE Network” for the network.
  • Connect at Power On (checked)
  • Adapter type PVRDMA (very important!)
  • Device Protocol: RoCE v2

Configure the VM

For Ubuntu:

sudo apt-get install rdma-core infiniband-diags ibverbs-utils

Tweak the module load order

In order for RDMA to work the vmw_pvrdma module has to be loaded after several other modules. Maybe someone else knows a better way to do this, but the method that I got to work was adding a script /usr/local/sbin/rdma-modules.sh to ensure that Infiniband modules are loaded on boot, then calling that from /etc/rc.local so it gets executed at boot time.

#!/bin/bash
# rdma-modules.sh
# modules that need to be loaded for PVRDMA to work
/sbin/modprobe mlx4_ib
/sbin/modprobe ib_umad
/sbin/modprobe rdma_cm
/sbin/modprobe rdma_ucm

# Once those are loaded, reload the vmw_pvrdma module
/sbin/modprobe -r vmw_pvrdma
/sbin/modprobe vmw_pvrdma

Once that’s done just set up the PVRDMA network interface the same as any other network interface.

Testing the network

To verify that I’m getting something close to 100Gbps on the network I use the perftest package.

To test bandwith I pick two VMs on different hosts. On one VM I run:

$ ib_send_bw --report_gbits

On the other VM I run the same command plus I add the IP address of the PVRDMA interface on the first machine:

$ ib_send_bw --report_gbits 192.168.128.39

That sends a bunch of data across the network and reports back:

So I’m getting an average of 96.31Gbps over the network connection.

I can also check the latency using the ib_send_lat:

Hope you find this useful.

post

Mouse button Copy & Paste on Ubuntu 20.04

Using the left mouse button to select and copy text in terminals and the middle mouse button to paste has been a feature of X-Windows, and the various window managers built on top of X-Windows, since the early 1990s. With the release of Ubuntu 20.04 and Gnome 3.36 Canonical has removed this convention, forcing a more awkward and slower select, right click, select Copy from a menu, point, right click, select Paste from menu to do the same thing.

If you want to restore select-to-copy, middle button to paste functionality to Ubuntu 20.04 just follow these steps.

Restore select-to-copy functionality

Edit the file .Xresources in your home directory.

Add the line:

xterm*selectToClipboard: true

… to the file, then logout of your desktop and log back in, or reboot.

Once you’ve done that any text that you select in the Terminal program with your left mouse button will be copied to your clipboard. Left click a word and the word is copied to the clipboard. Left click and drag to select and copy an entire line, an entire paragraph, or more.

Restore middle-button paste functionality

Install gnome-tweaks:

sudo apt-get install gnome-tweaks

Click “Activities” in the upper right and search for “tweaks”, click the “Tweaks” icon.

Select “Keyboard & Mouse” and turn “Middle Click Paste” to “on”.

Once you’ve done that, clicking the middle mouse button will paste text from your clipboard back into the terminal.

Hope you find this useful.

post

Automatically decrypt multiple LUKS-encrypted volumes

I’ve written in the past on Adding an external encrypted drive with LVM to Ubuntu Linux and Adding a LUKS-encrypted iSCSI volume to Synology DS414 NAS but I neglected to mention how to automatically decrypt additional volumes.

When installing a fresh copy of Ubuntu one of the options is to install with a LUKS-encrypted Logical Volume Manager Volume Group (LVM VG). This puts your root volume on the encrypted LVM VG. When you power up your machine Ubuntu prompts you to enter the decryption passphrase in order to decrypt the VG and start your computer. Without the passphrase the contents of your hard drive are unreadable.

If you add encrypted external drives and/or additional VGs you will end up with multiple encrypted volumes. Ubuntu will prompt you for the passphrase of each additional encrypted volume when you boot up the machine.

If you don’t want to enter multiple, different passphrases each time you boot, you can store the passphrases for additional volumes on the encrypted root filesystem of your first drive using the /etc/crypttab file. You’ll just be prompted for one passphrase, of the first VG, and that decrypts the passphrases needed to decrypt the additional volumes.

Here’s how it works.

The /etc/crypttab file contains 4 fields per line: the name of the encrypted volume, a UUID identifying the storage device, the name of a file with the decryption passphrase, and encryption options.

nvme0n1p5   UUID=405d8c73-1cf9-4b2c-9b8e-c76b90d27c67 none                        luks,discard
datastorage UUID=f2d73ac8-1ef1-4735-9dd4-9e778fc9e781 /root/.luks-datastorage     luks,discard
external1   UUID=0140476b-dd0b-4aab-b7d4-2f5fa14d1a0c /root/.luks-backupexternal1 luks
external2   UUID=610a67d4-c4f6-4b73-a824-a437971e8d24 /root/.luks-backupexternal2 luks
iscsi       UUID=b106b749-f4ab-44be-8962-6ff867dc074e /root/.luks-backupiscsi     luks

The first volume, nvme0n1p5, is the encrypted boot volume. It contains the root filesystem and the /root home directory. The third field is “none” which means that Ubuntu will prompt you for a decryption passphrase in order to unlock and decrypt the drive.

The remaining volumes have files defined that contain the decryption passphrase for each volume. Those files are hidden files in the /root home directory. Once the nvme0n1p5 volume is decrypted and mounted, the remaining volumes are automatically decrypted using the passphrases stored in the hidden files.

The end result is that all of your drives are encrypted, but you only have to enter one passphrase to unlock all of your drives.

Hope you find this useful.

post

Setting up a personal, production-quality Kubernetes cluster with Kubespray

I’ve been setting up and tearing down Kubernetes clusters for testing various things for the past year, mostly using Vagrant/Virtualbox but also some VMware vSphere and OpenStack deployments.

I wanted to set something a little more permanent up at my home lab — a cluster where I could add and remove nodes, run nodes on multiple physical machines, and use different types of compute hardware.

Set up the virtual machines

To get started I used a desktop System76 Wild Dog Pro Linux box (4.5 GHz i7-7700K, 64GB DDR4) and my create-vm script to create six Ubuntu 18.04 “Bionic Beaver” VMs for the cluster:

for n in $(seq 1 6); do
create-vm -n node$n \
-i ./ubuntu-18.04-server-amd64.iso \
-k ./ubuntu.ks \
-r 4096 \
-c 2 \
-s 40
done

With these parameters each VM will have 4GB RAM, 2 VCPUs, and a 40GB hard drive.

Install and configure Kubespray

I cloned Kubespray into a directory and created an Ansible inventory file following the instructions from the README.

git clone git@github.com:kubernetes-sigs/kubespray.git
cd kubespray
pip install -r requirements.txt
rm -Rf inventory/mycluster/
cp -rfp inventory/sample inventory/mycluster
declare -a IPS=($(for n in $(seq 1 6); do get-vm-ip node$n; done))
CONFIG_FILE=inventory/mycluster/hosts.ini \
python3 contrib/inventory_builder/inventory.py ${IPS[@]}

The get-vm-ip script is in the same repo as the create-vm script, and both are described in my Use .iso and Kickstart files to automatically create Ubuntu VMs article.

The inventory.py script generates an Ansible hosts inventory file in inventory/mycluster/hosts.ini with all of your VM IP addresses.

I like to add one variable override to the bottom of hosts.ini which copies the kubectl credentials over to my host machine. That way I can run kubectl commands directly from my desktop. The extra lines to add to the bottom of hosts.ini are:

[all:vars]
kubectl_localhost=true

Install Kubernetes

To install Kubernetes on the VMs I run the Kubespray cluster.yaml playbook:

export ANSIBLE_REMOTE_USER=ansible
ansible-playbook -i inventory/mycluster/hosts.ini \
--become --become-user=root cluster.yml

Once the playbooks have finished, you should have a fully-operational Kubernetes cluster running on your desktop.

At this point you should be able to query the cluster from your desktop using kubectl. For example:

$ kubectl cluster-info
Kubernetes master is running at https://192.168.122.251:6443
coredns is running at https://192.168.122.251:6443/api/v1/namespaces/kube-system/services/coredns:dns/proxy
kubernetes-dashboard is running at https://192.168.122.251:6443/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
node1 Ready master,node 3d6h v1.13.0
node2 Ready master,node 3d6h v1.13.0
node3 Ready node 3d6h v1.13.0
node4 Ready node 3d6h v1.13.0
node5 Ready node 3d6h v1.13.0
node6 Ready node 3d6h v1.13.0
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-67f89845f-6zbvx 1/1 Running 1 3d6h
kube-system calico-node-jh7ng 1/1 Running 2 3d6h
kube-system calico-node-l9vfb 1/1 Running 2 3d6h
kube-system calico-node-mqxjx 1/1 Running 2 3d6h
...

Set up the Kubernetes Dashboard

One of the first things I like to do is set up access to the Kubernetes dashboard. First I set up a service account for the admin user:

$ cat ~/Projects/k8s-cluster/dashboard-adminuser.yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: admin-user
namespace: kube-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: admin-user
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: admin-user
namespace: kube-system
$ kubectl apply -f ~/Projects/k8s-cluster/dashboard-adminuser.yaml

Next I get the bearer token for the user account:

$ kubectl -n kube-system describe secret $(kubectl -n kube-system get secret | grep admin-user | awk '{print $1}')

Finally I plug the dashboard URL that I got from kubectl cluster-info into my browser, select “Token” authentication, and cut and paste in the bearer token to log into the system.

Once logged in, an overview of my cluster pops up:

With a minimal amount of working compute infrastructure, it’s easy to set up your own production-quality Kubernetes cluster using Kubespray.

Hope you find this useful.

post

How to get the IP address of a KVM/virsh VM

Since virsh domifaddr doesn’t work to get the IP addresses of VMs on a bridged network, I wrote a get-vm-ip script (which you can download from Github) which uses this to get the IP of a running VM:

HOSTNAME=[your vm name]
MAC=$(virsh domiflist $HOSTNAME | awk '{ print $5 }' | tail -2 | head -1)
arp -a | grep $MAC | awk '{ print $2 }' | sed 's/[()]//g'

The virsh command gets the MAC address, the last line finds the IP address using arp.

Hope you find this useful.