post

Gitlab: The source branch does not exist [SOLVED]

Got an interesting error on Gitlab today. On an MR that had passed tests and had been approved, Gitlab would not allow the branch to be merged because “The source branch [Branch Name] does not exist. Please restore it or use a different source branch.”

Checking the “Changes” tab in Gitlab showed all of the changes that I expected to see. So Gitlab could see the changes and knew the branch name.

I checked the commit SHA connected to the branch, and Gitlab showed all of the changes that I expected to see. Gitlab knew the correct SHA and was associating the SHA with the branch name, but the “Overview” tab still showed the error message “The source branch [Branch Name] does not exist. Please restore it or use a different source branch.”

I Googled to see if anyone else had had the issue. Many people had. This issue report on Gitlab.com shows that the problem goes back at least 3 years. The official response from Gitlab is that “This is no longer an active issue.”

I beg to differ. I’m using GitLab Enterprise Edition 14.8.6-ee and it still has this issue.

I cloned a fresh copy of the repo from Gitlab on my laptop and checked the branches with:

git branch -a

The branch was listed. I checked it out. It checked out just fine. I diffed it against the main branch. All of the changes were present.

So the branch exists on the repository on the server, it has all of the changes in it that I expect, there is nothing wrong with the Git repo itself, it’s just Gitlab that has some sort of disconnect.

To work around the problem I added an empty commit to the branch and pushed it to the Gitlab repo:

$ git commit --allow-empty -m "Empty commit"
$ git push

Once I did that the error message cleared. Gitlab had to re-run the tests, but once the tests passed I could merge the branch.

Hope you find this useful.

Update: I spoke to one of the people who manages GitLab Enterprise for my team and apparently Gitlab works by scraping the Git repo for updates and changes using a “scraper service”. If the service is off-line or reset at the time your commit is pushed it’s supposed to pick up any missed pushes when the service restarts, except sometimes it doesn’t work.

As long as the service is on-line the next time you push a commit then Gitlab will pick up the new commit and any missed commits, and everything will be back in sync.

post

Calculating the value for 64bitMMIOSizeGB

When adding a GPU to a vSphere VM using PCI passthrough there are a couple of additional settings that you need to make or your VM won’t boot.

When creating the VM you’ll need to set the Actions > Edit > VM Options > Boot Options > Firmware and select “EFI”. You need to do this before you install the operating system on the VM. If you don’t do this the GPUs won’t work and the VM won’t boot.

To add a GPU, in vCenter go to the VM, select Actions > Edit > Add New Device. Any GPUs set up as PCI passthrough devices should appear in a pick list. Add one or more GPUs to your VM.

Note that after adding one device, when you add additional GPUs the first GPU you selected still appears in the pick list. If you add the same GPU more than once your VM will not boot. If you add a GPU that’s being used by another running VM your VM will not boot. Pay attention to the PCI bus addresses displayed and make sure that the GPUs you pick are unique and not in use on another VM.

Finally you have to set up memory-mapped I/O (MMIO) to map system memory to the GPU’s framebuffer memory so that the CPU can pass data to the GPU. In vCenter go to the VM, select Actions > Edit > VM Options > Advanced > Edit configuration.

Once you’re on the Configuration parameters screen, add two more parameters:

pciPassthru.use64bitMMIO = TRUE
pciPassthru.64bitMMIOSizeGB = ????
Actions > Edit > VM Options > Advanced > Edit configuration

The 64bitMMIOSizeGB value is calculated by adding up the total GB of framebuffer memory on all GPUs attached to the VM.  If the total GPU framebuffer memory falls on a power-of-2, setting pciPassthru.64bitMMIOSizeGB to the next power of 2 works.

If the total GPU framebuffer memory falls between two powers-of-2, round up to the next power of 2, then round up again, to get a working setting.

Powers of 2 are 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024 …

For example, two NVIDIA A100 cards with 40GB each = 80GB (in between 64GB and 128GB), so round up to the next power of 2 (128GB), then round up again to the next power of 2 after that (256GB) to get the correct setting. If you set it too low the VM won’t boot, but it won’t give you an error message telling you what the issue is either.

Here are some configurations that I’ve tested and verified:

  • 2 x 16GB NVIDIA V100 = 32GB, 32 is a power of 2, so round up to the next power of 2 which is 64, set pciPassthru.64bitMMIOSizeGB = 64 to boot.
  • 2 x 24GB NVIDIA P40 = 48GB, which is in-between 32 and 64, round up to 64 and again to 128, requires pciPassthru.64bitMMIOSizeGB = 128 to boot.
  • 8 x 16GB NVIDIA V100 = 128GB, 128 is a power of 2, so round up to the next power of 2 which is 256, set pciPassthru.64bitMMIOSizeGB = 256 to boot.
  • 10 x 16GB NVIDIA V100 = 160GB, which is in-between 128 and 256, round up to 256 and again to 512, set pciPassthru.64bitMMIOSizeGB = 512 to boot.

Hope you find this useful.

post

Setting up NFS FSID for multiple networks

The official documentation for creating an NFS /etc/exports file for multiple networks and FSIDs is unclear and confusing. Here’s what you need to know.

If you need to specify multiple subnets that are allowed to mount a volume, you can either use separate lines in /etc/exports, like so:

/opt/dir1 192.168.1.0/24(rw,sync)
/opt/dir1 10.10.0.0/22(rw,sync)

Or you can list each subnet on a single line, repeating all of the mount options for each subnet, like so:

/opt/dir1 192.168.1.0/24(rw,sync) 10.10.0.0/22(rw,sync)

These are both equivalent. They will allow clients in the 192.168.1.0/24 and 10.10.0.0/22 subnets to mount the /opt/dir1 directory via NFS. A client in a different subnet will not be able to mount the filesystem.

When I’m setting up NFS servers I like to assign each exported volume with a unique FSID. If you don’t use FSID, there is a chance that when you reboot your NFS server the way that the server identifies the volume will change between reboots, and your NFS clients will hang with “stale file handle” errors. I say “a chance” because it depends on how your server stores volumes, what version of NFS it’s running, and if it’s a fault tolerant / high availability server, how that HA ability was implemented. Using a unique FSID ensures that the volume that the server presents is always identified the same way, and it makes it easier for NFS clients to reconnect and resume operations after an NFS server gets rebooted.

If you are using FSID to define a unique filesystem ID for each mount point you must include the same FSID in the export options for a single volume. It’s the “file system identifier”, so it needs to be the same each time a single filesystem is exported. If I want to identify /opt/dir1 as fsid=1 I have to include that declaration in the options every time that filesystem is exported. So for the examples above:

/opt/dir1 192.168.1.0/24(rw,sync,fsid=1)
/opt/dir1 10.10.0.0/22(rw,sync,fsid=1)

Or:

/opt/dir1 192.168.1.0/24(rw,sync,fsid=1) 10.10.0.0/22(rw,sync,fsid=1)

If you use a different FSID for one of these entries, or if you declare the FSID for one subnet and not the other, your NFS server will slowly and mysteriously grind to a halt, sometimes over hours and sometimes over days.

For NFSv4, there is the concept of a “distinguished filesystem” which is the root of all exported filesystems. This is specified with fsid=root or fsid=0, which both mean exactly the same thing. Unless you have a good reason to create a distinguished filesystem don’t use fsid=0, it will just add unnecessary complexity to your NFS setup.

Hope you find this useful.

post

Updating ESXi root passwords and authorized ssh keys with Ansible

I manage a number of vCenter instances and a lot of ESXi hosts. Some of the hosts are production, some for test and development. Sometimes an ESXi host needs to be used by a different group or temporarily moved to a new cluster and then back again afterwards.

To automate the configuration of these systems and the VMs running on them I use Ansible. For a freshly-imaged, new installation of ESXi one of the first things I do it to run an Ansible playbook that sets up the ESXi host, and the first thing it does is to install the ssh keys of the people who need to log in as root, then it updates the root password.

I have ssh public keys for every user that needs root access. A short bash script combines those keys and my Ansible management public key into authorized_keys files for the ESXi hosts in each vCenter instance. In my Ansible group_vars/ directory is a file for each group of ESXi hosts, so all of the ESXi hosts in a group get the same root password and ssh keys. This also makes it easy to change root passwords and add and remove ssh keys of users as they are added to or leave different groups.

Here’s a portion of a group_vars/esxi_hosts_cicd/credentials.yml file for a production CICD cluster:

# ESXI Hosts (only Ops can ssh in)
esxi_root_authorized_keys_file: authkeys-ops

esxi_username: 'root'
esxi_password: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          34633832366431383630653735663739636466316262
          39363165663566323864373930386239380085373464
          32383863366463653365383533646437656664376365
          31623564336165626162616263613166643462356462
          34633832366431383630653735663739636466316262
          39363165663566323864373930386239380085373464
          32383863366463653365383533646437656664376365
          31623564336165626162616263613166643462356462
          3061

The password is encrypted using Ansible Vault.

In my main.yml file I call the esxi_host role for all of the hosts in the esxi_hosts inventory group. Since I use a different user to manage non-ESXi hosts, the play that calls the role tells Ansible to use the root user only when logging into ESXi hosts.

- name: Setup esxi_hosts
  gather_facts: False
  user: root
  hosts: esxi_hosts
  roles:
    - esxi_host

The esxi_host role has an esxi_host/tasks/main.yml playbook. The two plays that update the authorized_keys file and root password look like this:

- name: Set the authorized ssh keys for the root user
  copy:
    src: "{{ esxi_root_authorized_keys_file }}"
    dest: /etc/ssh/keys-root/authorized_keys
    owner: root
    group: root
    mode: '0600'

- name: Set the root password for ESXI Hosts
  shell: "echo '{{ esxi_password }}' | passwd -s"
  no_log: True

The first time I run this the password is set to some other value, so I start Ansible with:

ansible-playbook main.yml \
    --vault-id ~/path/to/vault/private/key/file \
    -i inventory/ \
    --limit [comma-separated list of new esxi hosts] \
    --ask-pass \
    --ask-become-pass

This will prompt me for the current root ssh password. Once I enter that it logs into each ESXi host, installs the new authorized_keys file, uses the vault private key to decrypt the password, then updates the root password.

After I’ve done this once, since the Ansible ssh key is also part of the authorized_keys file, subsequent Ansible updates just use the ssh key to login, and I don’t have to use --ask-pass or --ask-become-pass parameters.

This is also handy when switching a host from one cluster to another. As long as the ssh keys are installed I no longer need the current root password to update the root password.

Hope you find this useful.

post

Allow ping from specific subnets to AWS EC2 instances using Terraform

If you’re using Terraform to set up EC2 instances on AWS you may be a little confused about how to allow ping through the AWS VPC firewall, especially if you want to limit ping so that it only works from specific IPs or subnets.

To do this just add a Terraform ingress security group rule to the aws_security_group:

ingress {
  cidr_blocks = ["1.2.3.4/32"]
  from_port   = 8
  to_port     = 0
  protocol    = "icmp"
  description = "Allow ping from 1.2.3.4"
}

The above rule will only allow ping from the single IPv4 address “1.2.3.4”. You can use the cidr_blocks setting to allow ping from any set of IPv4 IP address and subnets that you wish. If you want to allow IPv6 addresses use the ipv6_cidr_blocks setting:

ingress {
  cidr_blocks       = ["1.2.3.4/32"]
  ipv6_cidr_blocks  = [aws_vpc.example.ipv6_cidr_block]
  from_port         = 8
  to_port           = 0
  protocol          = "icmp"
  description       = "Allow ping from 1.2.3.4 and the example.ipv6_cidr_block"
}

Right about now you should be scratching your head and asking why a port range is specified from port 8 to port 0? Isn’t that backwards? Also, this is ICMP, so why are we specifying port ranges at all?

Well, for ICMP security group rules Terraform uses the from_port field to define the ICMP message type, and “ping” is an ICMP “echo request” type 8 message.

So why is to_port = 0? Since ICMP is a network-layer protocol there is no TCP or UDP port number associated with ICMP packets as these numbers are associated with the transport layer, which is above the network layer. So you might think it’s set to 0 because it’s a “don’t care” setting, but that is not the case.

It’s actually set to 0 because Terraform (and AWS) use the to_port field to define the ICMP code of the ICMP packet being allowed through the firewall, and “ping” is defined as a type 8, code 0 ICMP message.

I have no idea why Terraform chose to obscure the usage this way, but I suspect it’s because the AWS API reuses the from_port field for storing the ICMP message type, and reuses the to_port for storing the ICMP code, and Terraform just copied their bad design. A more user-friendly implementation of Terraform would have created an icmp_message_type and icmp_message_code fields (or aliases) that are mapped to the AWS from_port and to_port fields to make it obvious what you’re setting and why it works.

Hope you find this useful.