I’m using Ansible to set up the network interface cards of multiple racks of storage servers running Centos 6.6. Each server has four network interfaces to configure, a public 1GbE interface, a private 1GbE interface, and two 10GbE interfaces that are set up as a bonded 20GbE interface with two VLANs assigned to the bond.
If Ansible changes an interface on a server it calls a handler to restart the network interfaces so the changes go into effect. However, I don’t want the network interfaces of every single server in a cluster to restart at the same time, so at the beginning of my network.yml
playbook I set:
serial: 1
That way Ansible just updates the network config of one server at a time.
Also, if there are any failures I want Ansible to stop immediately, so if I screwed something up I don’t take out the networking to every computer in the cluster. For this reason I also set:
max_fail_percentage: 1
If a change is made to an interface I’ve been using the following handler to restart the interface:
- name: Restart Network service: name=network state=restarted
That works, but about half the time Ansible detects a failure and drops out with an error, even though the network restarted just fine. Checking the server immediately after Ansible says that there’s an error shows that the server is running and it’s network interfaces were configured correctly.
This behavior is annoying since you have to restart the entire playbook after one server fails. If you’re configuring many racks of servers and the network setup is just updating one server at a time I’d end up having to restart the playbook a half dozen times to get through it, even though nothing was actually wrong.
At first I thought that maybe the ssh connection was dropping (I was restarting the network after all) but you can log in via ssh and restart the network and never lose the connection, so that wasn’t the problem.
The connection does pause as the interface that you’re ssh-ing in over resets, but the connection comes right back.
I wrote a short script to repeatedly restart the network interfaces and check the exit code returned, but the exit code was always 0, “no errors”, so network restart wasn’t reporting an error, but for some reason Ansible thought there was a failure.
There’s obviously some sort of timing issue causing a problem, where Ansible is checking to see if all is well, but since the network is being reset the check times out.
I initially came up with this workaround:
- name: Restart Network shell: service network restart; sleep 3
That fixes the problem, however, since “sleep 3” will always exit with a 0 exit code (success), Ansible will always think this worked even when the network restart failed. (Ansible takes the last exit code returned as the success/failure of the entire shell operation.) If “service network restart” actually does fail, I want Ansible to stop processing.
In order to preserve the exit code, I wrote a one-line Perl script that restarts the network, sleeps 3 seconds, then exits with the same exit code returned by “service network restart”.
- name: Restart Network # Restart the network, sleep 3 seconds, return the # exit code returned by "service network restart". # This is to work-around a glitch in Ansible where # it detects a successful network restart as a failure. command: perl -e 'my $exit_code = system("service network restart"); sleep 3; $exit_code = $exit_code >> 8; exit($exit_code);'
Now Ansible grinds through the network configurations of all of the hosts in my racks without stopping.
Hope you find this useful.