I first started paying attention to network MTU settings when I was building petabyte-scale object storage systems. Tuning the network that backs your storage requires maximizing the size of the data packets and verifying that packets aren’t being fragmented. Currently I’m working on performance tuning the processing of image data using racks of GPU servers and verifying the network MTU came up again. I dug up a script I’d used before and thought I’d share it in case other people run into the same problem.
You can set the host network interface’s MTU setting to 9000 on all of the hosts in your network to enable jumbo frames, but how can you verify that the settings are working? If you’ve set up servers in a cloud environment using multiple availability zones or multiple regions, how can you verify that there isn’t a switch somewhere in the middle of your connection that doesn’t support MTU 9000 and fragments your packets?
Use this shell script:
while ping -s $size -M do -c1 $target_host >&/dev/null; do
echo "Max MTU size: $((size-4+28))"
-s $size sets the size of the packet being sent.
-M do prohibits fragmentation, so ping fails if the packet fragments.
-c1 sends 1 packet only.
size-4+28 = subtract the last 4 bytes added (that caused the fragmentation), add 28 bytes for the IP and ICMP headers.
If minimizing packet fragmentation is important to you, set MTU to 9000 on all hosts and then run this test between every pair of hosts in the network. If you get an unexpectedly low value, troubleshoot your switch and host settings and fix the issue.
Assuming that all of your hosts and switches are configured at their maximum MTU values, then the minimum value returned from the script is the actual maximum MTU you can support without fragmentation. Use the minimum value returned as your new host interface MTU setting.
If you’re operating in a cloud environment you may need to repeat this exercise from time to time as switches are changed and upgraded at your cloud provider.
For a long time rebooting a host with Ansible has been tricky. The steps are:
ssh to the host
Reboot the host
Disconnect before the host closes your ssh connection
Wait some number of seconds to ensure the host has really shut down
Attempt to ssh to the host and execute a command
Repeat ssh attempt until it works or you give up
Seems clear enough, but if you Google for an answer you may end up at this StackExchange page that gives lots of not-quite-correct answers from 2015 (and one correct answer). Some people suggest checking port 22, but just because ssh is listening doesn’t mean that it’s at state where it’s accepting connections.
The correct answer is use Ansible version 2.7 or greater. 2.7 introduced the reboot command, and now all you have to do is add this to your list of handlers:
- name: Reboot host and wait for it to restart
msg: "Reboot initiated by Ansible"
This handler will:
Reboot the host
Wait 30 seconds
Attempt to connect via ssh and run whoami
Disconnect after 5 seconds if it ssh isn’t working
Keep attempting to connect for 10 minutes (600 seconds)
Add the directive:
notify: Reboot host and wait for it to restart
… to any Ansible command that requires a reboot after a change. The host will be rebooted when the playbook finishes, then Ansible will wait until the host is back up and ssh is working before continuing on to the next playbook.
If you need to reboot halfway through a playbook you can force all handlers to execute with the command:
- name: Reboot if necessary
I sometimes do that to change something, force a reboot, then verify that the change worked, all within the same playbook.
The AWS Elastic Filesystem (EFS) gives you an NFSv4-mountable file system with almost unlimited storage capacity. The filesystem I just created to write this article reports 9,007,199,254,739,968 bytes free. In human-readable format df -kh reports 8.0E (Exabytes) of available disk space. In the year 2019, that’s a lot of storage space.
In past articles I’ve shown how to create EFS resources manually, but this week I wanted to programmatically create EFS resources with Terraform so that I could easily create, test, and tear-down EFS and VM resources on AWS.
I also wanted to make sure that my EFS resources are secure, that only VMs within my Virtual Private Cloud (VPC) could access the EFS data, so that no one outside of my VPC could mount or otherwise access the data.
Creating an EFS resource is easy. The Terraform code looks like this:
The file_system_id is automatically set to the efs-example resource’s ID, which ties the mount target to the EFS file system.
The subnet_id for subnet-efs is a separate /24 subnet I created from my VPC just for EFS. The ingress-efs security group is a separate security group I created for EFS. Let’s cover each one of these separately.
A separate EFS subnet
First off I’ve allocated a /16 subnet for my VPC and I carve out individual /24 subnets from that VPC for each cluster of VMs and/or EFS resources that I add to an AWS availability zone. Here’s how I’ve defined my test environment VPC and EFS subnet:
Finally, I need a security group that only allows traffic between my test environment VMs and my test environment EFS volume. I already have a security group called ingress-test-env that is used to control security for my VMs. For EFS I create another security group that allows inbound traffic on port 2049 (the NFSv4 port), allows egress traffic on any port.
By setting the ingress-efs-test resource’s security_groups attribute to ingress-test-env this only allows network traffic to and from VMs in the ingress-test-env security group to talk to the EFS volume. If you use security_groups like this, you really lock down the EFS volume and you don’t need to set the cidr_blocks attribute at all.
Ceph is a distributed storage system that provides object, file, and block storage. On each storage node you’ll find a file system where Ceph stores objects and a Ceph OSD (Object storage daemon) process. On a Ceph cluster you’ll also find Ceph MON (monitoring) daemons, which ensure that the Ceph cluster remains highly available.
Rook acts as a Kubernetes orchestration layer for Ceph, deploying the OSD and MON processes as POD replica sets. From the Rook README file:
Rook turns storage software into self-managing, self-scaling, and self-healing storage services. It does this by automating deployment, bootstrapping, configuration, provisioning, scaling, upgrading, migration, disaster recovery, monitoring, and resource management. Rook uses the facilities provided by the underlying cloud-native container management, scheduling and orchestration platform to perform its duties.
When I created the cluster I built VMs with 40GB hard drives, so with 5 Kubernetes nodes that gives me ~200GB of storage on my cluster, most of which I’ll use for Ceph.
Installing Rook+Ceph is pretty straightforward. On my personal cluster I installed Rook+Ceph v0.9.0 by following these steps:
git clone firstname.lastname@example.org:rook/rook.git cd rook git checkout v0.9.0 cd cluster/examples/kubernetes/ceph kubectl create -f operator.yaml kubectl create -f cluster.yaml
Rook deploys the PODs in two namespaces, rook-ceph-system and rook-ceph. On my cluster it took about 2 minutes for the PODs to deploy, initialize, and get to a running state. While I was waiting for everything to finish I checked the POD status with:
Now I need to do two more things before I can install Prometheus and Grafana:
I need to make Rook the default storage provider for my cluster.
Since the Prometheus Helm chart requests volumes formatted with the XFS filesystem, I need to install XFS tools on all of my Ubuntu Kubernetes nodes. (XFS is not yet installed by Kubespray by default, although there’s currently a PR up that addresses that issue.)
Make Rook the default storage provider
To make Rook the default storage provider I just run a kubectl command:
That updates the rook-ceph-block storage class and makes it the default for storage on the cluster. Any applications that I install will use Rook+Ceph for their data storage if they don’t specify a specific storage class.
Install XFS tools
Normally I would not recommend running one-off commands on a cluster. If you want to make a change to a cluster, you should encode the change in a playbook so it’s applied every time you update the cluster or add a new node. That’s why I submitted a PR to Kubespray to address this problem.
However, since my Kubespray PR has not yet merged, and I built the cluster using Kubespray, and Kubespray uses Ansible, one of the easiest ways to install XFS tools on all hosts is by using the Ansible “run a single command on all hosts” feature:
cd kubespray export ANSIBLE_REMOTE_USER=ansible ansible kube-node -i inventory/mycluster/hosts.ini \ --become --become-user root \ -a 'apt-get install -y xfsprogs'
Deploy Prometheus and Grafana
Now that XFS is installed I can successfully deploy Prometheus and Grafana using Helm: