Ansible Role for RHEL Atomic Host

Ansible Role for RHEL Atomic Host

This morning I was asked by a friend if I could share any Ansible roles we use at $WORK for our Red Hat Atomic Host servers.  It was a relatively easy task to review and sanitize our configs – Atomic Hosts are so minimal, there’s almost nothing we have to do to configure them.

When our Atomic hosts are initially created, they’re minimally configured via cloud-init to setup networking and add a root user ssh key.  (We have a VMWare environment, so we use the RHEL Atomic .ova provided by Red Hat, and mount an ISO with the cloud-init ‘user-data’ and ‘metadata’ files to be read by cloud-init.).  Once that’s done, we run Ansible tasks from a central server to setup the rest of the Atomic host.

Below is a snippit of most of the playbook.

I think the variables are self-explanatory.   Some notes are added to explain why we’re doing a particular thing.  The disk partitioning is explained in more detail in a previous post of mine.

  # Set w/Ansible because cloud-init is plain text
  - name: Access | set root password
      name: root
      password: "{{ root_password }}"

  - name: Access | add ssh user keys
      user: "{{ }}"
      key: "{{ item.key }}"
    with_items: "{{ ssh_users }}"

  - name: Access | root access to cron
      dest: /etc/security/access.conf
      line: "+:root:cron crond"

  - name: Access | fail closed
      dest: /etc/security/access.conf
      line: "-:ALL:ALL"

  # docker-storage-setup service re-configures LVM
  # EVERY TIME Docker service starts, eventually
  # filling up disk with millions of tiny files
  - name: Disks | disable lvm archives
      src: lvm.conf
      dest: /etc/lvm/lvm.conf
      - restart lvm2-lvmetad

  - name: Disks | expand vg with extra disks
      vg: '{{ volume_group }}'
      pvs: '{{ default_pvs }}'

  - name: Disks | expand the lvm
      vg: '{{ volume_group }}'
      lv: '{{ root_lv }}'
      size: 15g

  - name: Disks | grow fs for root
      fstype: xfs
      dev: '{{ root_device }}'
      resizefs: yes

  - name: Disks | create srv lvm
      vg: '{{ volume_group }}'
      lv: '{{ srv_lv }}'
      size: 15g

  - name: Disks | format fs for srv
      fstype: xfs
      dev: '{{ srv_device }}'
      resizefs: no

  - name: Disks | mount srv
      name: '{{ srv_partition }}'
      src: '{{ srv_device }}'
      fstype: xfs
      state: mounted
      opts: 'defaults'

  ## This is a workaround for XFS bug (only grows if mounted)
  - name: Disks | grow fs for srv
      fstype: xfs
      dev: '{{ srv_device }}'
      resizefs: yes

  ## Always check this, or it will try to do it each time
  - name: Disks | check if swap exists
      path: '{{ swapfile }}'
      get_checksum: no
      get_md5: no
    register: swap

  - debug: var=swap.stat.exists

  - name: Disks | create swap lvm
   ## Shrink not supported until 2.2
   #lvol: vg=atomicos lv=swap size=2g shink=no
      vg: atomicos
      lv: swap
      size: 2g

  - name: Disks |make swap file
    command: mkswap '{{ swapfile }}'
      - swap.stat.exists == false

  - name: Disks | add swap to fstab
      dest: /etc/fstab
      regexp: "^{{ swapfile }}"
      line: "{{ swapfile }}  none    swap    sw    0   0"

  - name: Disks | swapon
    command: swapon '{{ swapfile}}'
    when: ansible_swaptotal_mb < 1   

  - name: Docker | setup docker-storage-setup     
      dest: /etc/sysconfig/docker-storage-setup
      regexp: ^ROOT_SIZE=
      line: "ROOT_SIZE=15G"
      register: docker-storage-setup

  - name: Docker | setup docker-network
      dest: /etc/sysconfig/docker-network 
      line: >
        'DOCKER_NETWORK_OPTIONS=-H unix:///var/run/docker.sock 
        -H tcp:// 

  - name: add CA certificate
      src: ca.crt 
      dest: /etc/pki/tls/certs/ca.crt 
      owner: root 
      group: root 
      mode: 0644

  - name: Admin Helpers | thinpool wiper script
      dest: /usr/local/bin/
      mode: 0755

  - name: Journalctl | set journal sizes
      src: journald.conf
      dest: /etc/systemd/journald.conf
      mode: 0644
      - restart systemd-journald

  - name: Random Atomic Bugfixes | add lastlog
      path: /var/log/lastlog
      state: touch

  - name: Random Atomic Bugfixes | add root bashrc for prompt
      src: root-bashrc
      dest: /root/.bashrc
      mode: 0644

  - name: Random Atomic Bugfixes | add root bash_profile for .bashrc
      src: root-bash_profile
      dest: /root/.bash_profile
      mode: 0644

  ### Disable Cloud Init ###
  ## These are in Ansible 2.2, which we don't have yet
  - name: stop cloud-config
    systemd: name=cloud-config state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: stop cloud-init
    systemd: name=cloud-init state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: stop cloud-init-local
    systemd: name=cloud-init-local state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: stop cloud-final
    systemd: name=cloud-final state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: Find old cloud-init files if they exist
    shell: rm -f /etc/init/cloud-*
    ignore_errors: yes

The only other tasks we run are related to $WORK specific stuff (security office scanning, patching user account for automated updates, etc).

One of the beneficial side-effects of mixing cloud-init and Ansible is that the cloud-init is only used for the initial setup (networking and root access), so it ends up being under the size limit imposed by Amazon Web Services on their user-data files.  This allows us to create and maintain RHEL Atomic hosts in AWS using the exact same cloud-init user-data file and Ansible roles.

Using Docker and AWS to Survive an Outage

Last week at $WORK, we suffered from an outage that slowed down a large part of our network and took down our main website for both internal and external customers.  We were under a distributed denial of service attack focused on the website itself.  The site is load-balanced, and this resulted in slowdowns or outages for all the services behind the load balancers, as well.

While folks were bouncing ideas around on how to bring the site up again while still struggling with the outage, I mentioned that I could pretty quickly migrate the site over to Amazon Web Services and run it in Docker containers there. The higher-ups gave me the go-ahead and a credit card (very important, heh) and told me to get it setup.  The idea was to have it there so we could fail over to the cloud if we were unable to resolve the outage in a reasonable time.

TL;DR – I did, it was easy, and we failed over all external traffic to the cloud. Details below.

Amazon Web Services

DockerDespite having a credit card and a pretty high blanket “OK”, I wanted to make sure we didn’t spend any money unless it was absolutely necessary. To that end, I created three of the “free tier” EC2 instances (1GB RAM, 1 CPU, 10GB Storage) rather than one or more larger instances. After all, these servers were going to be doing one thing and one thing only – running Docker. I took all the defaults, except two. First, I opted to use RHEL7 as the OS. We use Red Hat at work, so I’m familiar with it (and let’s be honest, it works really well), especially where setting up Docker comes in. Second, I set up a security group that allowed only HTTP/HTTPS traffic to the EC2 instances, and SSH access only from $WORK. Security groups are like a logical firewall, I guess – run by Amazon in front of the servers themselves.

The EC2 instances started almost immediately, and I logged in via SSH using the key pair I created for this project. The first thing I did was augment the security group by setting the IPTables firewall on the hosts themselves to match: SSH from $WORK only, drop everything else, even pings.  You know, just in case.

Note: Since I was planning to use Docker to run the website, I didn’t need to add IPTables rules for HTTP/HTTPS. Docker uses the FORWARD chain, since it NATs from the host IP to the containers, and Docker has the ability to add and remove rules from the chain itself as needed.

Next, I ran a quick *yum update* to get the latest patches on the EC2 instance. It wasn’t terribly out of date, so this was quick.

Now to the meat of things. I didn’t really want to muck about with repos or try to find which one was required to install the Docker RPM, so I just copied the RPM for Docker from our local repository. The RPM is packaged upstream by Red Hat, and includes Docker 1.2.1. Even though I wanted to use Docker 1.4.1, the older RPM version is no big deal – I just installed it to get the basic config files – systemd service files, sysconfig, etc. Once the RPM was installed, I downloaded the Docker 1.4.1 binary from, and replaced the 1.2.1 binary from the RPM. Presto! Latest Docker with the handy *docker exec* command! At this point, the server itself was basically done, and I moved on to setting up the Docker image.

Time spent so far: About 5 minutes


Now, I didn’t have an image for our website ready to go or anything – I was going to have to build it from scratch.  However, I’ve been lucky enough to be allowed to play around with Docker at $WORK, and had already done some generic Images for web stacks for our public DockerDemos project (, so I was familiar with what I’d need to build the image for our site. I wrote a Dockerfile and built the image on my local laptop to test it. I went through a few revisions to get it perfect, but it only took about 15 minutes to write it from scratch. Once that was ready, I copied the Dockerfile and supporting files up to the EC2 servers, and built the images there. With the magic that is Docker and Linux containers, everything functioned exactly as it did on my laptop, and in a few seconds all three EC2 instances had the website image ready to go.

The final step was to run the container from the image. On all three of the EC2 instances, I ran:

docker run --name website -p 80:80 -p 443:443 -d website && \
docker logs -f website

The first command immediately started up the web servers inside the containers and started to sync their content, and the second opened up STDOUT inside the container so I could watch the progress. In a minute or two the sync was done, and the servers were online!

Note: The “sync” I’m talking about is part of how our website works, not something related to Docker itself.

Total time spent: About 25 minutes

So, in one fell swoop – about a half hour – I was able to create three servers running a Docker image to serve our main website, from scratch. It’s a good thing, too. It wasn’t long before we made the call to fail over, and currently all of our external traffic to the site is being served by these three containers.

That seems cool, no?  But check this out:

Sunday night, I needed to add more servers to the rotation. It was late. I was cranky to have been called after hours. I logged into AWS and used the EC2 “Create Image” feature to commit one of the running instances to a custom image (took about a minute). Then, I spun up three more EC2 instances from that image. They started up as quickly as a normal EC2 instance, and contained all the work I’d already done to set up the first servers, including the Docker package, binary, and image. Once they were up, all I had to do was run the *docker run* command again, and they were ready to go. Elapsed time?

2 minutes

It took longer for the 5 minute time-to-live on our DNS entry to expire.

Docker is Awesome. With AWS, it’s Awesome-er. I’m trying to convince folks that we should leave all of our external traffic to be served by Docker in AWS, and to migrate more sites out there. At the very least, it’s extremely flexible and allows us to respond to issues on a whole different timescale than we could before.

Oh, and an added bonus? All of our external monitoring (in multiple sites across the country) report that our page load speeds have improved 3x compared to what they were on the servers hosted in-house with regular non-Docker setups. I’m investigating what is giving us that increase this week.

Oh, and a second added bonus? For the last five days, our bill from Amazon for hosting our main website is a whopping *$4.69*. That’s a cup of crappy venti mochachino soy caramel crumble arabian dark coffee (or whatever) at the local coffee chain. And I can do without the calories.


Well, it’s been six months since this little adventure took place.  Since then, this solution has worked so well that we left all of our external traffic pointing to these instances at AWS.  Arguably, things have gotten even easier the more I work with both AWS and Amazon.  To that point:

  1. The Docker approach worked so well that we replaced all of the servers hosting our website internally with basic RHEL7 servers running Docker containers.  The servers are considerably more lightweight than they used to be, and as such we can get better performance out of them.
  2. I’ve since added the new(-ish) Docker flag –restart=always to the deploy command.  This saves me the step of even having to start the containers on reboot.
  3. I setup all the hosts to use the Docker API, and TLS authentication, so I can upload new images and start and stop containers on each host without even needing to login to them.  (This required the opening of port 2376 to $WORK in the security group and host firewall, fyi.)
  4. I wrote a couple of simple bash scripts to re-build the image as needed, and deploy locally for testing.  With the portable nature of Docker images, it’s extremely easy for me to test all changes before I push them out.
  5. Rotating the instances in and out of production at AWS is extremely simple with Amazon’s Elastic IP Addresses.  We are able to rotate a host out of service and instantly replace it with another, allowing us to patch them all with zero downtime.
  6. Amazon’s API is a wonderful thing.  I can manage the entire thing with some python scripts on my laptop, or the convenient Amazon CLI package.

Docker and AWS have proven themselves to me, and though this process, the higher-ups at $WORK.  We’re embracing Docker whole-heartedly in our datacenters here, and we’ve moved a number of services to AWS now, as well.  The ease and flexibility of both is a boon to us, and to our clients, and it’s starting to transform the way we do things in IT – the way we do everything in IT.