Restarting network interfaces in Ansible

I’m using Ansible to set up the network interface cards of multiple racks of storage servers running Centos 6.6. Each server has four network interfaces to configure, a public 1GbE interface, a private 1GbE interface, and two 10GbE interfaces that are set up as a bonded 20GbE interface with two VLANs assigned to the bond.

If Ansible changes an interface on a server it calls a handler to restart the network interfaces so the changes go into effect. However, I don’t want the network interfaces of every single server in a cluster to restart at the same time, so at the beginning of my network.yml playbook I set:

  serial: 1

That way Ansible just updates the network config of one server at a time.

Also, if there are any failures I want Ansible to stop immediately, so if I screwed something up I don’t take out the networking to every computer in the cluster. For this reason I also set:

max_fail_percentage: 1

If a change is made to an interface I’ve been using the following handler to restart the interface:

- name: Restart Network
  service: name=network state=restarted

That works, but about half the time Ansible detects a failure and drops out with an error, even though the network restarted just fine. Checking the server immediately after Ansible says that there’s an error shows that the server is running and it’s network interfaces were configured correctly.

This behavior is annoying since you have to restart the entire playbook after one server fails. If you’re configuring many racks of servers and the network setup is just updating one server at a time I’d end up having to restart the playbook a half dozen times to get through it, even though nothing was actually wrong.

At first I thought that maybe the ssh connection was dropping (I was restarting the network after all) but you can log in via ssh and restart the network and never lose the connection, so that wasn’t the problem.

The connection does pause as the interface that you’re ssh-ing in over resets, but the connection comes right back.

I wrote a short script to repeatedly restart the network interfaces and check the exit code returned, but the exit code was always 0, “no errors”, so network restart wasn’t reporting an error, but for some reason Ansible thought there was a failure.

There’s obviously some sort of timing issue causing a problem, where Ansible is checking to see if all is well, but since the network is being reset the check times out.

I initially came up with this workaround:

- name: Restart Network
  shell: service network restart; sleep 3

That fixes the problem, however, since “sleep 3” will always exit with a 0 exit code (success), Ansible will always think this worked even when the network restart failed. (Ansible takes the last exit code returned as the success/failure of the entire shell operation.) If “service network restart” actually does fail, I want Ansible to stop processing.

In order to preserve the exit code, I wrote a one-line Perl script that restarts the network, sleeps 3 seconds, then exits with the same exit code returned by “service network restart”.

- name: Restart Network
  # Restart the network, sleep 3 seconds, return the
  # exit code returned by "service network restart".
  # This is to work-around a glitch in Ansible where
  # it detects a successful network restart as a failure.
  command: perl -e 'my $exit_code = system("service network restart"); sleep 3; $exit_code = $exit_code >> 8; exit($exit_code);'

Now Ansible grinds through the network configurations of all of the hosts in my racks without stopping.

Hope you find this useful.

6 thoughts on “Restarting network interfaces in Ansible

  1. Hello – what version of ansible are you using? There’s a bug in 1.9 that’s causing this sort of behaviour from the service module, and it sounds like that’s biting you right now. If you’re able to test with the devel branch that might help (it’s fixed in devel) – or alternatively maybe for a short while run with command: /sbin/service instead?

    • I have been using v1.8, which is the packaged version in Centos 6.6, but I also have a bleeding edge copy from GitHub on my laptop. I’ll give bleeding edge a shot, but on production systems I have to use the packaged version.

  2. Hi, why use heavy stuff like perl and not bash directly ?

    service network restart; rc=$?; sleep 3; exit $rc

    • You mean that there’s more than one way to do it? Who would have thought?

      Seriously though, I used to write all of my simple scripts in Bash, more complex scripts in Perl or Python or Ruby, and C or C++ or Erlang if I needed performance.

      Many times my simple Bash script would grow over time, often to the point where it was problematic to maintain or extend, and I’d end up converting it to Perl.

      After doing this a few dozen times over the course of my career I started defaulting to Just Use Perl In The First Place. It saves me development time in the long run, and in a majority of cases my time is worth more than the extra CPU cycles. In cases where the extra CPU cycles are worth it, I’m writing code in C or C++ or Erlang, not Bash.

  3. The version I used ansible 2.0.0.2 still needs the ” sleep=3″ avoid the hanging indefinitely. But, the other length Python fix doesn’t work. So, for me, the following tiny tweek work:
    – name: Restart Network
    shell: service network restart; sleep 3

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.