Making JIRA Suck Less

Posted on 2025-01-08 by Earl C. Ruby III

Why JIRA Sucks

JIRA is almost universally reviled by every engineer that I know. Most of them can’t quite explain why is sucks, they just hate it.

In my view the problem isn’t JIRA, it’s how managers implement JIRA’s features that causes so much suckage. Here’s a short list of the problems that I see every time I start at a new company:

JIRA’s default settings suck. Most people start with the default settings and try to build on top of them, extending them as they go along. If you build on a foundation that sucks, whatever you build is also going to suck. Don’t use the defaults.
Managers try to implement overly-complex workflows. I’ve been successful using five status states. You might need four or six. You don’t need 17. Use the minimum number of status values required to express the states you need to track.
Managers try to implement workflows that require changing a ticket’s owner when the status changes. e.g. Bob finished the coding, now we need to assign it to Carol and Ted for code review, then Alice needs to write the test plan and QA the code, and finally Vladimir needs to sign off on QA and generate a release build before the ticket is complete. Later the VP of Engineering gives Vladimir a huge bonus because he’s the only one completing any tickets, and the fact that Bob & Carol & Ted & Alice worked hard on that ticket has been lost unless you manually check the ticket’s history. There’s no way to get a list of the tickets that Bob worked on, or Carol, or Ted, or Alice. A single ticket should be assigned to a single person, worked to completion, and closed.
Managers add values for priorities, resolutions, status, and other fields without documenting how they’re supposed to be used or training their staff how they are supposed to use them. Use the absolute minimum number that you need, make sure they’re self-explanatory, then still train your staff on how to use them.
Managers limit which status states can transition to other status states, frustrating end-users. Allow every status to transition to any other status.
Managers use generic names for priorities, resolutions, and status fields that are meaning-free, or use multiple names that have almost identical meanings. Do I “close” a ticket or “resolve” it? Which priority is higher, “urgent” or “critical”? Use the minimum number of values that you can, and make the choices self-explanatory.
No one cleans up their shit. If a manager adds a new field for a poorly-thought out project to track something or other on JIRA tickets, and then abandons that effort after a month, you can bet that engineers will still be prompted to enter a value for that field 4 years later. Resist the temptation to add more fields to JIRA, clean up after yourself when you give in to that temptation, and don’t be afraid to delete data.

Making JIRA suck less

Many years ago I was working for a startup that got bought by a large disk drive manufacturer. I was doing R&D work on large-scale object storage and as we were launching a new project I was told we needed to use JIRA to manage our task workloads. I wanted to dive in and start documenting all of the tasks that we needed to complete to get the project started, but one of the older engineers stopped me. He said we needed to meet first to discuss the configuration of JIRA for this project.

I was very reluctant to do this, my boss was asking me to get the tasks entered so we could start planning schedules and assigning work, but this guy was more experienced with JIRA so I met with him first. Afterwards I was glad I waited. I’ve applied what I learned at every company since then and made JIRA suck less at all of them. I documented what I did, and many managers I worked for have reached out to me after we parted ways on asking for a copy of my “JIRA document” so they could apply it at their new jobs. This is that document.

JIRA Workflow

Goals for JIRA

These are the things we want to accomplish using JIRA.

Track all of the tasks engineers are working on.
Be able to report on what tasks are necessary to fix a bug, finish a release, complete a feature, complete a project, or are related to a specific portion of the software stack.
Make using JIRA usage frictionless by having few, very concise, clear values for ticket fields so that an engineer never has to wonder what value a JIRA field should have.
Have a clear set of goals for each Sprint.
Get better at estimating how much work we can get done within a period of time.
Make sure that bugs are being fixed in a reasonable period of time.
Use JIRA’s automated reports and dashboards as a way to communicate back to PMs, Sales, Execs, and Engineers how much progress had been made on towards delivering the features they were specifically interested in.
Use JIRA’s automated dashboards to forecast how close we are to completing major deliverables.
Make sure that the tradeoffs that need to be made when goals are changed are clear to PMs, Sales, Execs, and Engineers.

Ticket Scope

A ticket should describe a single task that can be done by a single person within a 2 week period. A ticket that requires one person for longer than that should be broken into separate tickets.

If a ticket requires QA, documentation, or another related but independent task, create another ticket for that task and link the two tickets. Tickets can be linked across projects.

Do not create sub-task tickets under a ticket, just create more tickets. Sub-tasks have limitations on tracking, reporting, and cannot be part of a different project. Don’t use them.

If a project is truly huge with many moving parts, create an Epic and put the tickets in the Epic.

JIRA Fields

Ticket Status

There are five ticket statuses:

Backlog
- All tickets start out as Backlog.
- New tickets are not assigned to anyone and are not scheduled.
- To schedule the ticket is assigned to someone and the status is changed to Selected for Development.
Selected for Development – Work has been defined, assigned, and scheduled but not started.
In Progress – The ticket is actively being worked on by the person it’s assigned to.
In Review – Assignee has completed all tasks, is waiting for reviewers to complete their reviews, or is waiting for blocking items to be completed, e.g. QA tasks, other engineering work, requestor sign-off. Blocking items should have their own tickets and be linked to the tickets that they block.
Closed – Ticket has been completed. A “Status: Closed” ticket has a small set of possible resolution values:
- Done
- Duplicate
- Rejected (Won’t Do)
- Cannot Reproduce

In order to reduce the amount of task-switching and improve focus each engineer should have no more than 3 – 5 tickets actively being worked on, “Selected for Development” and “In Progress,” combined.

Using this method and a 4-column Kanban board based on Ticket Status (omit backlog) every engineer and manager can see at a glance what needs to be done this week (Selected for Development), what’s in progress, what’s being reviewed, and what has been done.

Transitions should be defined from every state to every state. If someone wants to drag a ticket from the “Selected for Development” column and drop it in the “Closed” column they should be able to do that.

Transitions to Closed should prompt the user to fill in the resolution state

Assignee

The person who will be doing the work to complete the ticket. Usually this is set when the ticket is Selected for Development and doesn’t change.

Reporter

The name of the person reporting the problem or requesting the feature.

Components

Each ticket has one or more Components. A component is a limited set of fixed categories which define which group has the primary responsibility for the ticket. Components are usually named after a product or service being developed. End-users should not be able to create new components.

A ticket with no assigned owner may be automatically assigned to the lead person responsible for that component.

Components are mostly used for reporting, to see how much backlog remains to be done for a given software product or service. By keeping the number of components limited to a small set of categories they become useful for reporting, running queries, or building dashboards.

Labels

Labels are “free form” and can be used to tag a ticket so that it’s included in specific reports.

Now if a manager has something that they want to report on across multiple tickets, rather than adding another field they can just add a label and generate reports and make queries based on that label. When they lose interest a month later they can stop using that label, without forcing engineers to fill in extra, unnecessary fields for years to come.

Customer

Name of the customer (if any) who requested this task or reported this bug. This is so you can follow up with the customer afterwards to let them know the issue was fixed, or know who to ask if there isn’t enough information given to resolve the task.

Issue Types

Task – Something that needs to be done.
Bug – Something that needs to be fixed.
Epic – A collection of tasks and/or bugs needed to complete a feature.

Remove any other issue types included as JIRA defaults.

Need a “Story”? A Story is just a loosely-defined Epic that’s in a Backlog state. You don’t need a “Story” type.

Need a “Feature”? A feature is either a Task or an Epic with multiple Tasks in it. You don’t need a separate “Feature” issue type.

Task, Bug, Epic. That’s it. Keep choices to a minimum so end users don’t have to think about what to use.

Affects Version / Fixed Version

Affects Version – For a bug, the version or versions affected by the bug.
Fixed Version – The version (first release) or versions (first release, edge release, maintenance release) where the fixed bug or completed task appears or is scheduled to appear.

Versions are used when it comes time to issue a release. You can easily see what work needs to be done to complete all of the tasks in a release. You can generate a change log of all of the changes that are in a release.

Priorities

By default JIRA assigns a default priority to tickets. After a while you have to wonder, is this ticket really a “High” priority or is it a “High” priority because no one changed the default when the ticket was created To avoid this, make your tickets start with priority “None”. Now it’s clear that no priority was assigned.

At my company, if the filer doesn’t set the ticket priority, the product manager, engineering manager or team lead would set the ticket’s priority. If there is disagreement they can have a discussion to determine the correct priority, and the managers make the final decision. If the managers cannot reach agreement the VP of engineering breaks the tie. I don’t think it’s ever gotten to the VP.

Valid priorities are:

P0/Blocker – Drop everything else and fix this
P1/Critical – Important and urgent
P2/High – High priority. Important or urgent, not both
P3/Low – Low priority. Not important or urgent, but should get done as time permits
None – Priority has not been determined (default)

However you assign priorities to tickets at your company, define the process and let people know what it is.

Description

In the description of every ticket the filer has to include a “Definition of Done” (DoD) — a statement of what the system behavior would look like once the problem is fixed. This was very important since the person filing the ticket often has a very different expectation of what “done” looks like compared to what the person completing the ticket thinks “done” looks like. This mismatch can occur in both directions — sometimes the person doing the work does far less than what was needed and sometimes they do far more than what was needed, turning a 2 hour task into a two week project.

If a ticket is assigned to someone with no DoD the assignee should ask the reporter to add a DoD.

Due Date

If we promised a customer or anyone that a task would be completed by a specific date, then fill in the due date. Otherwise leave it blank.

Additional Fields

You may need some additional fields for your workflow. Some people like to track story points or effort required per ticket. If you need them, add them, just try to keep the number of fields that someone needs to fill in to the absolute minimum.

If you later find you’re not using a field, delete it.

Retrospective

Meeting every two weeks to discuss completed tasks, uncompleted tasks, what went well and what could have been done better over the past two weeks.

Stand Ups

When you do a stand-up and have the Kanban board for the group displayed on a large monitor, then the stand-up only needs to cover two questions:

Is the board correct?
Do you have any blockers?

Engineers appreciate brief standups. Make sure you’re tracking the right things, make sure that everyone has what they need to get things done that day. Standup completed.

Git

If a branch is created for a JIRA ticket put the branch name in the ticket. If an MR has an associated JIRA ticket you should be able to find the JIRA ticket from the MR or the MR from the JIRA ticket.

Both Github and Gitlab have JIRA plugins that can post updates to JIRA tickets with a link to the MR, on test pass/fail status, merge status, reviewer comments, etc. They can even automatically close tickets when an MR merges. Use these plugins to automate workflows and reduce time spent by engineers managing their JIRA tickets.

Summary

Keep the number of JIRA fields to fill in to a minimum.
Keep the workflow simple.
Don’t make people think about what they need to do — make it obvious.
Document your workflow.
Make sure that end users know how to use the system.
Automate required reports and dashboards.

I have applied these rules at three startups and several pretty large companies since I first wrote them down. Hopefully you can use some of these lessons at your company, because you may be required to use JIRA, but it doesn’t have to suck.

Copy entire file directories from a Linux host to Box

Posted on 2024-09-12 by Earl C. Ruby III

I had about 2TB of files on a cloud-based Linux host that I needed to backup to cloud storage. I had an Box Enterprise storage account with a 30PB limit on storage and a maximum file size of 150GB, so I decided to try to connect from Linux to Box and store all of the backup data in Box. You can check your own limits under Box “Account Settings”, bottom of the page:

The most difficult part of getting Box to work on a headless, cloud-based Linux host is getting authorization to work. Box wants to use OAuth2 web-based authentication, and I need to set up Box access on a remote host where I’m connecting via ssh and there is no web browser or desktop. The easiest way that I’ve found to do this is to generate an OAuth2 bearer token on my laptop that’s formatted using the JSON Web Token (JWT) format and then copy that to the Linux host.

I used rclone for the backup. I first installed rclone on the Ubuntu Linux host:

sudo apt-get install rclone

Then I installed rclone on my Mac laptop:

brew install rclone

I pulled up a terminal on the Mac and configured rclone so that my Box account was authorized:

rclone authorize box

This will cause a browser window to pop up and ask you to log into Box. Once you’ve logged in and authorized rclone to read and write files in your Box drive the command will finish up and spit out a bearer token:

$ rclone authorize box
2024/09/12 08:57:15 NOTICE: Config file "/Users/eruby/.config/rclone/rclone.conf" not found - using defaults
2024/09/12 08:57:15 NOTICE: If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth?state=qqf7swGNZ8pH4iJksvR3xA
2024/09/12 08:57:15 NOTICE: Log in and authorize rclone for access
2024/09/12 08:57:15 NOTICE: Waiting for code…
2024/09/12 08:57:45 NOTICE: Got code
Paste the following into your remote machine --->
{"access_token":"nOtaReaLacCessT0k3NsuCkas","token_type":"bearer","refresh_token":"nOtaReaLb3aR3rT0k3NKxRrg2J1rB7DKzKg6svazAlwAwHWKl","expiry":"2024-09-12T10:02:31.314087-07:00"}
<---End paste

The bearer token contains an access token, a refresh token, and an expiration date. The access token is good for an hour. After that it expires and application (rclone) will use the refresh token to get a new access token. This can keep happening until the user that generated the token (you) is no longer allowed to access Box or your password changes. If you change your password you’ll need to generate a new bearer token. The refresh token may expire before your password changes, depending on the security policy of the organization issuing the refresh token, so at some point you may need to regenerate the bearer token even if you don’t change your password.

Now you just have to paste the bearer token into the Linux host’s rclone config, so log into the Linux host and run rclone config. Here’s the entire interaction with the config command:

$ rclone config
2024/09/12 18:19:19 NOTICE: Config file "/home/eruby/.config/rclone/rclone.conf" not found - using defaults
No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n
name> Box
Type of storage to configure.
Enter a string value. Press Enter for the default ("").
Choose a number from below, or type in your own value
 1 / 1Fichier
   \ "fichier"
 2 / Alias for an existing remote
   \ "alias"
 3 / Amazon Drive
   \ "amazon cloud drive"
 4 / Amazon S3 Compliant Storage Provider (AWS, Alibaba, Ceph, Digital Ocean, Dreamhost, IBM COS, Minio, Tencent COS, etc)
   \ "s3"
 5 / Backblaze B2
   \ "b2"
 6 / Box
   \ "box"
 7 / Cache a remote
   \ "cache"
 8 / Citrix Sharefile
   \ "sharefile"
 9 / Dropbox
   \ "dropbox"
10 / Encrypt/Decrypt a remote
   \ "crypt"
11 / FTP Connection
   \ "ftp"
12 / Google Cloud Storage (this is not Google Drive)
   \ "google cloud storage"
13 / Google Drive
   \ "drive"
14 / Google Photos
   \ "google photos"
15 / Hubic
   \ "hubic"
16 / In memory object storage system.
   \ "memory"
17 / Jottacloud
   \ "jottacloud"
18 / Koofr
   \ "koofr"
19 / Local Disk
   \ "local"
20 / Mail.ru Cloud
   \ "mailru"
21 / Microsoft Azure Blob Storage
   \ "azureblob"
22 / Microsoft OneDrive
   \ "onedrive"
23 / OpenDrive
   \ "opendrive"
24 / OpenStack Swift (Rackspace Cloud Files, Memset Memstore, OVH)
   \ "swift"
25 / Pcloud
   \ "pcloud"
26 / Put.io
   \ "putio"
27 / SSH/SFTP Connection
   \ "sftp"
28 / Sugarsync
   \ "sugarsync"
29 / Transparently chunk/split large files
   \ "chunker"
30 / Union merges the contents of several upstream fs
   \ "union"
31 / Webdav
   \ "webdav"
32 / Yandex Disk
   \ "yandex"
33 / http Connection
   \ "http"
34 / premiumize.me
   \ "premiumizeme"
35 / seafile
   \ "seafile"
Storage> 6
** See help for box backend at: https://rclone.org/box/ **

OAuth Client Id
Leave blank normally.
Enter a string value. Press Enter for the default ("").
client_id>
OAuth Client Secret
Leave blank normally.
Enter a string value. Press Enter for the default ("").
client_secret>
Box App config.json location
Leave blank normally.

Leading `~` will be expanded in the file name as will environment variables such as `${RCLONE_CONFIG_DIR}`.

Enter a string value. Press Enter for the default ("").
box_config_file>
Box App Primary Access Token
Leave blank normally.
Enter a string value. Press Enter for the default ("").
access_token>

Enter a string value. Press Enter for the default ("user").
Choose a number from below, or type in your own value
 1 / Rclone should act on behalf of a user
   \ "user"
 2 / Rclone should act on behalf of a service account
   \ "enterprise"
box_sub_type>
Edit advanced config? (y/n)
y) Yes
n) No (default)
y/n> n
Remote config
Use auto config?
 * Say Y if not sure
 * Say N if you are working on a remote or headless machine
y) Yes (default)
n) No
y/n> n
For this to work, you will need rclone available on a machine that has
a web browser available.

For more help and alternate methods see: https://rclone.org/remote_setup/

Execute the following on the machine with the web browser (same rclone
version recommended):

	rclone authorize "box"

Then paste the result below:
result> {"access_token":"nOtaReaLacCessT0k3NsuCkas","token_type":"bearer","refresh_token":"nOtaReaLb3aR3rT0k3NKxRrg2J1rB7DKzKg6svazAlwAwHWKl","expiry":"2024-09-12T10:02:31.314087-07:00"}
--------------------
[Box]
token = {"access_token":"nOtaReaLacCessT0k3NsuCkas","token_type":"bearer","refresh_token":"nOtaReaLb3aR3rT0k3NKxRrg2J1rB7DKzKg6svazAlwAwHWKl","expiry":"2024-09-12T10:02:31.314087-07:00"}
--------------------
y) Yes this is OK (default)
e) Edit this remote
d) Delete this remote
y/e/d> y
Current remotes:

Name                 Type
====                 ====
Box                  box

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q

To explain:

I named the remote connection “Box”.
I made the type of remote connection “box” (choice 6 from the list of supported remote storage types),
I didn’t edit any options, just answered “n” when asked if I wanted to make edits.
When I got to the Use auto config? question I answered “n”.
I copied the entire bearer token I got from my Mac into the result> field on the Linux box. (This is the entire contents inside the curly braces “{ … }” that rclone tells you to “Paste the following into your remote machine”.)

At this point rclone should be configured, now I just have to copy the data into Box. All of the data in in the /testdata directory, so I ran:

rclone copy --copy-links \
    --create-empty-src-dirs \
    /testdata Box:backup-testdata \
    > ~/box-backup-testdata.log 2>&1 &

There are symlinks in the directory, so --copy-links is needed.
If there are empty directories on the source, I want to copy them over, so
--create-empty-src-dirs is needed.
/testdata is the root directory that I’m copying over.
“Box” is the name I used for the remote connection.
backup-testdata is going to be the location of the data in my Box account.
Since this is going to take many hours to copy, I redirected all output to a log file and ran the process in the background. I can come back and check the log file later to see if everything worked.

Just to make sure everything is working I logged into Box with my laptop and I can see the new backup-testdata directory and it’s being populated with data, so everything appears to be working.

Hope you find this useful.

Thank you Adam Ru for the suggestion to try rclone.

Fix bouncing mail from a GNU Mailman server on Dreamhost

Posted on 2024-07-20 by Earl C. Ruby III

Updated on 2025-03-07 with new Dreamhost MailMan mail server DNS name.

GNU Mailman is free software for managing electronic mail discussion and e-newsletter lists. I started using it back in 1998 for managing internal email lists at a company I worked for. I’ve used it many times over the years, but stopped when email lists fell out of fashion. I liked it because it’s pretty easy to set up an actual discussion list, where replies go to the list (not the sender), which results in actual discussion.

I recently set one up again, using a my Dreamhost account and their automated web panel to deploy a discussion list for a volunteer group I manage. I was having a problem though, some of the people on the list weren’t getting all of the mail.

Mostly it was anyone with an @gmail.com mailing address. The odd part was that they were getting some messages, just not all messages. I had people check their spam folders but that wasn’t it.

Since messages weren’t ending up in SPAM folders that usually means that (a) the recipient’s email server is bouncing the message (refusing the message) or (b) something was wrong with Mailman’s settings.

I did some Googling today and found that many other people were reporting similar problems, but no one had a good solution other than to turn on bounced message troubleshooting, so I did that.

I logged into the list’s mailing list administration page, selected the “Bounce processing” setup option, and made sure that all notifications were turned ON.

After I did that I sent a message to the mailing list. Almost immediately I got back a bounce message from sbcglobal.net:

<listsubscriber@sbcglobal.net>: host ff-ip4-mx-vip1.prodigy.net[144.160.159.21]
    said: 553 5.3.0 flpd577 DNSBL:RBL 521< [Mailman's IP] >_is_blocked.For
    assistance forward this error to abuse_rbl@abuse-att.net (in reply to MAIL FROM command)

DNSBL:RBL is a realtime DNS blacklist designed to block spam. I went to DNSBL.info and checked my Mailmain server’s IP address. It wasn’t listed:

Next I went to check the DNS SPF record for the mailing list’s domain name. I had assumed that since I’d used Dreamhost’s web panel to install the Mailman service that Dreamhost would automatically take care of the SPF record.

I was wrong, there was no SPF record.

Well that explains a lot.

When a mail server (technically a “mail exchanger” or “MX” server) receives mail from another mail server one of the things that it will do is ask two questions:

What domain did this email come from?
Is the server that sent this mail allowed to send mail for that domain?

The way that the second question is answered is an SPF record. The receiving mail server looks up the DNS SPF record for the domain that sent the mail. If the SPF record says that the server sending the mail is allowed to send mail for the domain the SPF check passes and all is well. If the SPF record doesn’t exist, or doesn’t list the server that the mail came from, the SPF check fails and the mail gets bounced.

Dreamhost installs Mailman on a subdomain. My Mailman subdomain name didn’t have an SPF record. I was somewhat surprised that any mail was getting though. Usually a missing SPF record will stop all mail coming from a domain to be bounced.

So I added an SPF record for my subdomain. In my case I allow-listed the following:

Any IP with an A record for my subdomain. The mailing is is on a subdomain with one A record that points to the VM running the Mailman server.
Any IP with an MX record for my subdomain, so any assigned mail exchangers.
netblocks.dreamhost.com and relay.mailchannels.net – Suggested by Dreamhost tech support. I’m guessing “all netblocks assigned to Dreamhost” and “all mail relays operated by Dreamhost.”
listserver-dab.dreamhost.com – added this on 2025-03-07 after Dreamhost changed the outgoing mail server for Mailman to listserver-dab.dreamhost.com.

The subdomain’s DNS entry is a type “TXT” record with the contents:

"v=spf1 a mx include:netblocks.dreamhost.com
   include:relay.mailchannels.net
   include:listserver-dab.dreamhost.com ~all"

The ~all at the end says that anyone attempting to send mail from my domain using a server that isn’t in the list will “soft fail” the SPF test, which is interpreted by most mail exchange servers to mean “mark it as spam if it doesn’t come from one of the listed hosts.” If you want the MX server to “hard fail” (bounce) the message use -all (hard fail) instead.

I tend to use soft fail just in case the list subscriber’s server is misconfigured or there’s some other failure. In that case the MX server will send list messages to spam (so the list subscriber will still see it) rather than bounce the message.

If you need to set this up for yourself make sure that you list all hosts that send mail for your domain. There are a number of web tools available to help you create an SPF record with the correct parameters, just Google “create an spf record” and you’ll find half a dozen.

Hope you find this useful.

Quickly create guest VMs using virsh, cloud image files, and cloud-init

Posted on 2023-02-15 by Earl C. Ruby III

After the latest updates to the code these scripts now create VMs from full Linux distros in a few seconds.

I was looking for a way to automate the creation of VMs for testing various distributed system / cluster software packages. I’ve used Vagrant in the past but I wanted something that would:

Allow me to use raw image files as the basis for guest VMs.
Guest VMs should be set up with bridged IPs that are routable from the host.
Guest VMs should be able to reach the Internet.
Other hosts on the local network should be able to reach guest VMs. (Setting up additional routes is OK).
VM creation should work with any distro that supports cloud-init.
Scripts should be able to create and delete VMs in a scripted, fully-automatic manner.
Guest VMs should be set up to allow passwordless ssh access from the “ansible” user, so that once a VM is running Ansible can be used for additional configuration and customization of the VM.

I’ve previously used virsh’s virt-install tool to create VMs and I like how easy it is to set up things like extra network interfaces and attach existing disk images. The scripts in this repo fully automate the virsh VM creation process.

cloud-init

The current version of the create-vm script uses cloud images, cloud-init, and virsh tooling to quickly create VMs from the command line. Using a single Linux host system you can create multiple guest VMs running on that host. Each guest VM has its own file system, memory, virtualized CPUs, IP address, etc.

Cloud Images

create-vm creates a QCOW2 file for your VM’s file system. The QCOW2 image uses the cloud image as a base filesystem, so instead of copying all of the files that come with a Linux distribution and installing them, QCOW will just use files directly from the base image as long as those files remain unchanged. QCOW stands for “QEMU Copy On Write”, so once you make a change to a file the changes are written to your VM’s QCOW2 file.

Cloud images have the extension .img or .qcow and are compiled for different system architectures.

Cloud images are available for the following distros:

Pick the base image for the distro and release that you want to install and download it onto your host system. Make sure that the base image uses the same hardware architecture as your host system, e.g. “x86_64” or “amd64” for Intel and AMD -based host systems, “arm64” for 64 bit ARM-based host systems.

cloud-init configuration

cloud-init reads in two configuration files, user-data and meta-data, to initialize a VM’s settings. One of the places it looks for these files is any attached disk volume labeled cidata.

The create-vm script creates an ISO disk called cidata with these two files and passes that in as a volume to virsh when it creates the VM. This is referred to as the “no cloud” method, so if you see a cloud image for “nocloud” that’s the one you want to use.

If you’re interested in other ways of doing this check out the Datasources documentation on for cloud-init.

Files

create-vm stores files as follows:

${HOME}/vms/base/ – Place to store your base Linux cloud images.
${HOME}/vms/images/ – your-vm-name.img and your-vm-name-cidata.img files.
${HOME}/vms/init/ – user-data and meta-data.
${HOME}/vms/xml/ – Backup copies of your VMs’ XML definition files.

QCOW2 filesystems allocate space as needed, so if you create a VM with 100GB of storage, the initial size of the your-vm-name.img and your-vm-name-cidata.img files is only about 700K total. The your-vm-name.img file will grow as you install packages and update files, but will never grow beyond the disk size that you set when you create the VM.

Scripts

The create-vm repo contains these scripts:

create-vm – Use .img and cloud-init files to auto-generate a VM.
delete-vm – Delete a virtual machine created with create-vm.
get-vm-ip – Get the IP address of a VM managed by virsh.

Host setup

I’m running the scripts from a host with Ubuntu Linux 22.04 installed. I added the following to the host’s Ansible playbook to install the necessary virtualization packages:

  - name: Install virtualization packages
    apt:
      name: "{{item}}"
      state: latest
    with_items:
    - libvirt-bin
    - libvirt-clients
    - libvirt-daemon
    - libvirt-daemon-system
    - libvirt-daemon-driver-storage-zfs
    - python-libvirt
    - python3-libvirt
    - virt-manager
    - virtinst

If you’re not using Ansible just apt-get install the above packages.

Permissions

The libvirtd daemon runs under the libvirt-qemu user service account. The libvirt-qemu user must be able to read the files in ${HOME}/vms/. If your ${HOME} directory has permissions set to 0x750 then libvirt-qemu won’t be able to read the ${HOME}/vms/ directory.

You could open up your home directory, e.g.:

chmod 755 ${HOME}

… but that allows anyone logged into your Linux host to read everything in your home directory. A better approach is just to add libvirt-qemu to your home directory’s group. For instance, on my host my home directory is /home/earl owned by user earl and group earl, permissions 0x750:

$ chmod 750 /home/earl
$ ls -al /home
total 24
drwxr-xr-x   6 root      root      4096 Aug 28 21:26 .
drwxr-xr-x  21 root      root      4096 Aug 28 21:01 ..
drwxr-x--- 142 earl      earl      4096 Feb 16 09:27 earl

To make sure that only the libvirt-qemu user can read my files I can add the user to the earl group:

$ sudo usermod --append --groups earl libvirt-qemu
$ sudo systemctl restart libvirtd
$ grep libvirt-qemu /etc/group
earl:x:1000:libvirt-qemu
libvirt-qemu:x:64055:libvirt-qemu

That shows that the group earl, group ID 1000, has a member libvirt-qemu. Since the group earl has read and execute permissions on my home directory, libvirt-qemu has read and execute permissions on my home directory.

Note: The libvirtd daemon will chown some of the files in the directory, including the files in the ~/vms/images directory, to be owned by libvirt-qemu group kvm. In order to delete these files without sudo, add yourself to the kvm group, e.g.:

$ sudo usermod --append --groups kvm earl

You’ll need to log out and log in again before the additional group is active.

create-vm options

create-vm supports the following options:

OPTIONS:
   -h      Show this message
   -n      Host name (required)
   -i      Full path and name of the base .img file to use (required)
   -k      Full path and name of the ansible user's public key file (required)
   -r      RAM in MB (defaults to 2048)
   -c      Number of VCPUs (defaults to 2)
   -s      Amount of storage to allocate in GB (defaults to 80)
   -b      Bridge interface to use (defaults to virbr0)
   -m      MAC address to use (default is to use a randomly-generated MAC)
   -v      Verbose

Create an Ubuntu 22.04 server VM

This creates an Ubuntu 22.04 “Jammy Jellyfish” VM with a 40G hard drive.

First download a copy of the Ubuntu 22.04 “Jammy Jellyfish” cloud image:

mkdir -p ~/vms/base
cd ~/vms/base
wget http://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img

Then create the VM:

create-vm -n node1 \
    -i ~/vms/base/jammy-server-cloudimg-amd64.img \
    -k ~/.ssh/id_rsa_ansible.pub \
    -s 40

Once created I can get the IP address and ssh to the VM as the user “ansible”:

$ get-vm-ip node1
192.168.122.219
$ ssh -i ~/.ssh/id_rsa_ansible ansible@192.168.122.219
The authenticity of host '192.168.122.219 (192.168.122.219)' can't be established.
ED25519 key fingerprint is SHA256:L88LPO9iDCGbowuPucV5Lt7Yf+9kKelMzhfWaNlRDxk.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '192.168.122.219' (ED25519) to the list of known hosts.
Welcome to Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-60-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Wed Feb 15 20:05:45 UTC 2023

  System load:  0.47216796875     Processes:             105
  Usage of /:   3.7% of 38.58GB   Users logged in:       0
  Memory usage: 9%                IPv4 address for ens3: 192.168.122.219
  Swap usage:   0%

Expanded Security Maintenance for Applications is not enabled.

0 updates can be applied immediately.

Enable ESM Apps to receive additional future security updates.
See https://ubuntu.com/esm or run: sudo pro status



The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

ansible@node1:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           198M 1008K  197M   1% /run
/dev/sda1        39G  1.5G   38G   4% /
tmpfs           988M     0  988M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      105M  6.1M   99M   6% /boot/efi
tmpfs           198M  4.0K  198M   1% /run/user/1000
ansible@node1:~$

Note that this VM was created with a 40GB hard disk, and the total disk space shown is 40GB, but the actual hard drive space initially used by this VM was about 700K. The VM can consume up to 40GB, but will only use the space it actually needs.

Create 8 Ubuntu 22.04 servers

This starts the VM creation process and exits. Creation of the VMs continues in the background.

for n in `seq 1 8`; do
    create-vm -n node$n -i ~/vms/base/jammy-server-cloudimg-amd64.img -k ~/.ssh/id_rsa_ansible.pub
done

Delete 8 virtual machines

for n in `seq 1 8`; do
    delete-vm node$n
done

Connect to a VM via the console

virsh console node1

Connect to a VM via ssh

ssh ansible@$(get-vm-ip node1)

Generate an Ansible hosts file

(
    echo '[hosts]'
    for n in `seq 1 8`; do
        ip=$(get-vm-ip node$n)
        echo "node$n ansible_host=$ip ip=$ip ansible_user=ansible"
    done
) > hosts.ini

Handy virsh commands

virsh list – List all running VMs.

virsh domifaddr node1 – Get a node’s IP address. Does not work with all network setups, which is why I wrote the get-vm-ip script.

virsh net-list – Show what networks were created by virsh.

virsh net-dhcp-leases $network – Shows current DHCP leases when virsh is acting as the DHCP server. Leases may be shown for machines that no longer exist.

Hope you find this useful.

Determine maximum MTU

Posted on 2020-02-14 by Earl C. Ruby III

I first started paying attention to network MTU settings when I was building petabyte-scale object storage systems. Tuning the network that backs your storage requires maximizing the size of the data packets and verifying that packets aren’t being fragmented. Currently I’m working on performance tuning the processing of image data using racks of GPU servers and verifying the network MTU came up again. I dug up a script I’d used before and thought I’d share it in case other people run into the same problem.

You can set the host network interface’s MTU setting to 9000 on all of the hosts in your network to enable jumbo frames, but how can you verify that the settings are working? If you’ve set up servers in a cloud environment using multiple availability zones or multiple regions, how can you verify that there isn’t a switch somewhere in the middle of your connection that doesn’t support MTU 9000 and fragments your packets?

Use this shell script:

#!/bin/bash

# Determine the maximum MTU beteen the current host and a remote host
# Code from https://earlruby.org/2020/02/determine-maximum-mtu/

# Usage: max-mtu.sh $target_host

if ! which ping > /dev/null 2>&1; then
    echo "ping is not installed"
    exit 1
fi

target_host=$1
size=1272

if ! ping -c1 $target_host >&/dev/null; then
   echo "$target_host does not respond to ping"
   exit 1
fi

if ping -s $size -M do -c1 $target_host >&/dev/null; then
   # GNU ping
   nofragment='-M do'
else
   # BSD ping
   nofragment='-D'
fi

while ping -s $size $nofragment -c1 $target_host >&/dev/null; do
    ((size+=4));
done
echo "Max MTU size to $target_host: $((size-4+28))"

-s $size sets the size of the packet being sent.

-M do prohibits fragmentation, so ping fails if the packet fragments.

-c1 sends 1 packet only.

size-4+28 = subtract the last 4 bytes added (that caused the fragmentation), add 28 bytes for the IP and ICMP headers.

If minimizing packet fragmentation is important to you, set MTU to 9000 on all hosts and then run this test between every pair of hosts in the network. If you get an unexpectedly low value, troubleshoot your switch and host settings and fix the issue.

Assuming that all of your hosts and switches are configured at their maximum MTU values, and you run this script between every pair of hosts, then the minimum value returned from the script for all of your host-pairs is the actual maximum MTU you can support without fragmentation. Use the minimum value returned for all host-pairs as your new host interface MTU setting.

If you’re operating in a cloud environment you may need to repeat this exercise from time to time as switches are changed and upgraded at your cloud provider.

Hope you find this useful.