Ansible Role for RHEL Atomic Host

Ansible Role for RHEL Atomic Host

This morning I was asked by a friend if I could share any Ansible roles we use at $WORK for our Red Hat Atomic Host servers.  It was a relatively easy task to review and sanitize our configs – Atomic Hosts are so minimal, there’s almost nothing we have to do to configure them.

When our Atomic hosts are initially created, they’re minimally configured via cloud-init to setup networking and add a root user ssh key.  (We have a VMWare environment, so we use the RHEL Atomic .ova provided by Red Hat, and mount an ISO with the cloud-init ‘user-data’ and ‘metadata’ files to be read by cloud-init.).  Once that’s done, we run Ansible tasks from a central server to setup the rest of the Atomic host.

Below is a snippit of most of the playbook.

I think the variables are self-explanatory.   Some notes are added to explain why we’re doing a particular thing.  The disk partitioning is explained in more detail in a previous post of mine.

---
  # Set w/Ansible because cloud-init is plain text
  - name: Access | set root password
    user: 
      name: root
      password: "{{ root_password }}"

  - name: Access | add ssh user keys
    authorized_key:
      user: "{{ item.name }}"
      key: "{{ item.key }}"
    with_items: "{{ ssh_users }}"

  - name: Access | root access to cron
    lineinfile:
      dest: /etc/security/access.conf
      line: "+:root:cron crond"

  - name: Access | fail closed
    lineinfile:
      dest: /etc/security/access.conf
      line: "-:ALL:ALL"

  # docker-storage-setup service re-configures LVM
  # EVERY TIME Docker service starts, eventually
  # filling up disk with millions of tiny files
  - name: Disks | disable lvm archives
    copy:
      src: lvm.conf
      dest: /etc/lvm/lvm.conf
    notify:
      - restart lvm2-lvmetad

  - name: Disks | expand vg with extra disks
    lvg:
      vg: '{{ volume_group }}'
      pvs: '{{ default_pvs }}'

  - name: Disks | expand the lvm
    lvol:
      vg: '{{ volume_group }}'
      lv: '{{ root_lv }}'
      size: 15g

  - name: Disks | grow fs for root
    filesystem:
      fstype: xfs
      dev: '{{ root_device }}'
      resizefs: yes

  - name: Disks | create srv lvm
    lvol:
      vg: '{{ volume_group }}'
      lv: '{{ srv_lv }}'
      size: 15g

  - name: Disks | format fs for srv
    filesystem:
      fstype: xfs
      dev: '{{ srv_device }}'
      resizefs: no

  - name: Disks | mount srv
    mount:
      name: '{{ srv_partition }}'
      src: '{{ srv_device }}'
      fstype: xfs
      state: mounted
      opts: 'defaults'

  ## This is a workaround for XFS bug (only grows if mounted)
  - name: Disks | grow fs for srv
    filesystem:
      fstype: xfs
      dev: '{{ srv_device }}'
      resizefs: yes

  ## Always check this, or it will try to do it each time
  - name: Disks | check if swap exists
    stat:
      path: '{{ swapfile }}'
      get_checksum: no
      get_md5: no
    register: swap

  - debug: var=swap.stat.exists

  - name: Disks | create swap lvm
   ## Shrink not supported until 2.2
   #lvol: vg=atomicos lv=swap size=2g shink=no
    lvol:
      vg: atomicos
      lv: swap
      size: 2g

  - name: Disks |make swap file
    command: mkswap '{{ swapfile }}'
    when:
      - swap.stat.exists == false

  - name: Disks | add swap to fstab
    lineinfile:
      dest: /etc/fstab
      regexp: "^{{ swapfile }}"
      line: "{{ swapfile }}  none    swap    sw    0   0"

  - name: Disks | swapon
    command: swapon '{{ swapfile}}'
    when: ansible_swaptotal_mb < 1   

  - name: Docker | setup docker-storage-setup     
    lineinfile:       
      dest: /etc/sysconfig/docker-storage-setup
      regexp: ^ROOT_SIZE=
      line: "ROOT_SIZE=15G"
      register: docker-storage-setup

  - name: Docker | setup docker-network
    lineinfile: 
      dest: /etc/sysconfig/docker-network 
      regexp: ^DOCKER_NETWORK_OPTIONS= 
      line: >
        'DOCKER_NETWORK_OPTIONS=-H unix:///var/run/docker.sock 
        -H tcp://0.0.0.0:2376 
        --tlsverify 
        --tlscacert=/etc/pki/tls/certs/ca.crt 
        --tlscert=/etc/pki/tls/certs/host.crt 
        --tlskey=/etc/pki/tls/private/host.key'

  - name: add CA certificate
    copy: 
      src: ca.crt 
      dest: /etc/pki/tls/certs/ca.crt 
      owner: root 
      group: root 
      mode: 0644

  - name: Admin Helpers | thinpool wiper script
    copy:
      src: wipe_docker_thinpool.sh
      dest: /usr/local/bin/wipe_docker_thinpool.sh
      mode: 0755

  - name: Journalctl | set journal sizes
    copy:
      src: journald.conf
      dest: /etc/systemd/journald.conf
      mode: 0644
    notify:
      - restart systemd-journald

  - name: Random Atomic Bugfixes | add lastlog
    file:
      path: /var/log/lastlog
      state: touch

  - name: Random Atomic Bugfixes | add root bashrc for prompt
    copy:
      src: root-bashrc
      dest: /root/.bashrc
      mode: 0644

  - name: Random Atomic Bugfixes | add root bash_profile for .bashrc
    copy:
      src: root-bash_profile
      dest: /root/.bash_profile
      mode: 0644

  ### Disable Cloud Init ###
  ## These are in Ansible 2.2, which we don't have yet
  - name: stop cloud-config
    systemd: name=cloud-config state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: stop cloud-init
    systemd: name=cloud-init state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: stop cloud-init-local
    systemd: name=cloud-init-local state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: stop cloud-final
    systemd: name=cloud-final state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: Find old cloud-init files if they exist
    shell: rm -f /etc/init/cloud-*
    ignore_errors: yes

The only other tasks we run are related to $WORK specific stuff (security office scanning, patching user account for automated updates, etc).

One of the beneficial side-effects of mixing cloud-init and Ansible is that the cloud-init is only used for the initial setup (networking and root access), so it ends up being under the size limit imposed by Amazon Web Services on their user-data files.  This allows us to create and maintain RHEL Atomic hosts in AWS using the exact same cloud-init user-data file and Ansible roles.

Using Docker and AWS to Survive an Outage

Last week at $WORK, we suffered from an outage that slowed down a large part of our network and took down our main website for both internal and external customers.  We were under a distributed denial of service attack focused on the website itself.  The site is load-balanced, and this resulted in slowdowns or outages for all the services behind the load balancers, as well.

While folks were bouncing ideas around on how to bring the site up again while still struggling with the outage, I mentioned that I could pretty quickly migrate the site over to Amazon Web Services and run it in Docker containers there. The higher-ups gave me the go-ahead and a credit card (very important, heh) and told me to get it setup.  The idea was to have it there so we could fail over to the cloud if we were unable to resolve the outage in a reasonable time.

TL;DR – I did, it was easy, and we failed over all external traffic to the cloud. Details below.

Amazon Web Services

DockerDespite having a credit card and a pretty high blanket “OK”, I wanted to make sure we didn’t spend any money unless it was absolutely necessary. To that end, I created three of the “free tier” EC2 instances (1GB RAM, 1 CPU, 10GB Storage) rather than one or more larger instances. After all, these servers were going to be doing one thing and one thing only – running Docker. I took all the defaults, except two. First, I opted to use RHEL7 as the OS. We use Red Hat at work, so I’m familiar with it (and let’s be honest, it works really well), especially where setting up Docker comes in. Second, I set up a security group that allowed only HTTP/HTTPS traffic to the EC2 instances, and SSH access only from $WORK. Security groups are like a logical firewall, I guess – run by Amazon in front of the servers themselves.

The EC2 instances started almost immediately, and I logged in via SSH using the key pair I created for this project. The first thing I did was augment the security group by setting the IPTables firewall on the hosts themselves to match: SSH from $WORK only, drop everything else, even pings.  You know, just in case.

Note: Since I was planning to use Docker to run the website, I didn’t need to add IPTables rules for HTTP/HTTPS. Docker uses the FORWARD chain, since it NATs from the host IP to the containers, and Docker has the ability to add and remove rules from the chain itself as needed.

Next, I ran a quick *yum update* to get the latest patches on the EC2 instance. It wasn’t terribly out of date, so this was quick.

Now to the meat of things. I didn’t really want to muck about with repos or try to find which one was required to install the Docker RPM, so I just copied the RPM for Docker from our local repository. The RPM is packaged upstream by Red Hat, and includes Docker 1.2.1. Even though I wanted to use Docker 1.4.1, the older RPM version is no big deal – I just installed it to get the basic config files – systemd service files, sysconfig, etc. Once the RPM was installed, I downloaded the Docker 1.4.1 binary from Docker.io, and replaced the 1.2.1 binary from the RPM. Presto! Latest Docker with the handy *docker exec* command! At this point, the server itself was basically done, and I moved on to setting up the Docker image.

Time spent so far: About 5 minutes

Docker

Now, I didn’t have an image for our website ready to go or anything – I was going to have to build it from scratch.  However, I’ve been lucky enough to be allowed to play around with Docker at $WORK, and had already done some generic Images for web stacks for our public DockerDemos project (https://dockerdemos.github.io/appstack/), so I was familiar with what I’d need to build the image for our site. I wrote a Dockerfile and built the image on my local laptop to test it. I went through a few revisions to get it perfect, but it only took about 15 minutes to write it from scratch. Once that was ready, I copied the Dockerfile and supporting files up to the EC2 servers, and built the images there. With the magic that is Docker and Linux containers, everything functioned exactly as it did on my laptop, and in a few seconds all three EC2 instances had the website image ready to go.

The final step was to run the container from the image. On all three of the EC2 instances, I ran:

docker run --name website -p 80:80 -p 443:443 -d website && \
docker logs -f website

The first command immediately started up the web servers inside the containers and started to sync their content, and the second opened up STDOUT inside the container so I could watch the progress. In a minute or two the sync was done, and the servers were online!

Note: The “sync” I’m talking about is part of how our website works, not something related to Docker itself.

Total time spent: About 25 minutes

So, in one fell swoop – about a half hour – I was able to create three servers running a Docker image to serve our main website, from scratch. It’s a good thing, too. It wasn’t long before we made the call to fail over, and currently all of our external traffic to the site is being served by these three containers.

That seems cool, no?  But check this out:

Sunday night, I needed to add more servers to the rotation. It was late. I was cranky to have been called after hours. I logged into AWS and used the EC2 “Create Image” feature to commit one of the running instances to a custom image (took about a minute). Then, I spun up three more EC2 instances from that image. They started up as quickly as a normal EC2 instance, and contained all the work I’d already done to set up the first servers, including the Docker package, binary, and image. Once they were up, all I had to do was run the *docker run* command again, and they were ready to go. Elapsed time?

2 minutes

It took longer for the 5 minute time-to-live on our DNS entry to expire.

Docker is Awesome. With AWS, it’s Awesome-er. I’m trying to convince folks that we should leave all of our external traffic to be served by Docker in AWS, and to migrate more sites out there. At the very least, it’s extremely flexible and allows us to respond to issues on a whole different timescale than we could before.

Oh, and an added bonus? All of our external monitoring (in multiple sites across the country) report that our page load speeds have improved 3x compared to what they were on the servers hosted in-house with regular non-Docker setups. I’m investigating what is giving us that increase this week.

Oh, and a second added bonus? For the last five days, our bill from Amazon for hosting our main website is a whopping *$4.69*. That’s a cup of crappy venti mochachino soy caramel crumble arabian dark coffee (or whatever) at the local coffee chain. And I can do without the calories.

Update:

Well, it’s been six months since this little adventure took place.  Since then, this solution has worked so well that we left all of our external traffic pointing to these instances at AWS.  Arguably, things have gotten even easier the more I work with both AWS and Amazon.  To that point:

  1. The Docker approach worked so well that we replaced all of the servers hosting our website internally with basic RHEL7 servers running Docker containers.  The servers are considerably more lightweight than they used to be, and as such we can get better performance out of them.
  2. I’ve since added the new(-ish) Docker flag –restart=always to the deploy command.  This saves me the step of even having to start the containers on reboot.
  3. I setup all the hosts to use the Docker API, and TLS authentication, so I can upload new images and start and stop containers on each host without even needing to login to them.  (This required the opening of port 2376 to $WORK in the security group and host firewall, fyi.)
  4. I wrote a couple of simple bash scripts to re-build the image as needed, and deploy locally for testing.  With the portable nature of Docker images, it’s extremely easy for me to test all changes before I push them out.
  5. Rotating the instances in and out of production at AWS is extremely simple with Amazon’s Elastic IP Addresses.  We are able to rotate a host out of service and instantly replace it with another, allowing us to patch them all with zero downtime.
  6. Amazon’s API is a wonderful thing.  I can manage the entire thing with some python scripts on my laptop, or the convenient Amazon CLI package.

Docker and AWS have proven themselves to me, and though this process, the higher-ups at $WORK.  We’re embracing Docker whole-heartedly in our datacenters here, and we’ve moved a number of services to AWS now, as well.  The ease and flexibility of both is a boon to us, and to our clients, and it’s starting to transform the way we do things in IT – the way we do everything in IT.

How ’bout CoreOS as your Cloud base?

I’ve heard the name CoreOS around a little bit over the last two months or so, but it hadn’t really jumped out at me until last week when Mark McCahill mentioned it in a meeting. He’d read some pretty cool things about it: minimal OS, designed for running Docker containers, easy distributed configuration, default clustering and service discovery. In particular the use of etcd to manage data between clustered servers caught my eye – we’ve been struggling at $WORK with how to securely get particular types of data into Docker containers in a way that will scale out well if we ever need to start bringing up containers on hosts that don’t physically reside in our datacenters.

CoreOSI haven’t even gotten into the meat of CoreOS yet, but just now, I accomplished a task that was surely lifted out of science fiction. With the assistance of Cobbler as a PXE server and a Docker container (what else!) as a quick host for a cloud-config file, I was able to install CoreOS in seconds and SSH into it with an SSH public-key that I provided it in the cloud-config file. I was legitimately shocked by how quick and easy it was.

Docker containers start instantly – it’s one of their best features. It allows us to ship them around with impunity; perform seamless maintenance. CoreOS hosts live in the same timescale, meaning we an PXE boot and configure new hosts, specifically designed for hosting Docker containers, in seconds, and from anywhere. CoreOS offers support for installing onto bare metal, and that would surely give you the best performance, but take a moment to comprehend the flexibility given to you by using virtual machines instead.

Make an API call to your VMWare or Xen/KVM cluster in your local or remote datacenters to create and start a virtual machine. Or do the same thing with an Amazon host. Or Google. The VM PXE boots into CoreOS, within seconds and joins its cluster, and begins getting data from the rest of the cluster. Within minutes, Docker images are downloaded and built, and containers are spinning up into production. It doesn’t get any more flexible than that. At this point scaling and disaster recovery are hindered only by your ability to produce applications that can handle it. It doesn’t matter if a particular Docker container; something happens to it, you just bring up another somewhere else. Along those lines, it doesn’t matter if an entire host is up or down. That’s what it means to be in the same timescale. Containers and their hosts can be brought up and down with impunity, with no impact to the service.

Another benefit to abstracting the CoreOS away from the bare metal is the freeing of ties to a particular technology. If you can design your systems to use the APIs of your local VM solution and the remote APIs from various cloud vendors, then you can move your services wherever you need them. As long as you can control the routing of traffic in some way (load balances, DNS, Hipache, some of the cool new things being done by Cisco), and the DHCP PXE options for your host servers, then your services are effectively ephemeral and not tied to a particular location or vendor.

For now this is all still very beta, both Docker and CoreOS, but the promise being shown is real. Everyone, from the largest internet giants to the smallest one-room startups, will benefit from the coming revolution in computing.

"Cloud-style" Docker Demo Container

Completed a first pass at a minimal “Cloud-style”#Docker container. It’s sort of like an EC2 instance. You generate an ssh pem file, and pass the public key in as an environmental variable at docker run:

sudo docker run -i -t -d -P \
-e PUBKEY="$(cat ~/.ssh/my.pem.pub)" cloudbase

You end up with a CentOS container, and a user “clouduser” that has sudo w/no password rights.

I think this would be a good way to get some folks interested in Docker – perhaps offering something like this as a playground/sandbox to build interest.

Visit a website, get a Docker CentOS container!

Code: https://github.com/DockerDemos/CloudBase


I’m beginning to copy over my technology-related posts from Google+ to this blog, mostly so I have an easy-to-read record of them. This one was originally published on 19 May 2014: Cloud-style Docker Container