Ansible Role for RHEL Atomic Host

Ansible Role for RHEL Atomic Host

This morning I was asked by a friend if I could share any Ansible roles we use at $WORK for our Red Hat Atomic Host servers.  It was a relatively easy task to review and sanitize our configs – Atomic Hosts are so minimal, there’s almost nothing we have to do to configure them.

When our Atomic hosts are initially created, they’re minimally configured via cloud-init to setup networking and add a root user ssh key.  (We have a VMWare environment, so we use the RHEL Atomic .ova provided by Red Hat, and mount an ISO with the cloud-init ‘user-data’ and ‘metadata’ files to be read by cloud-init.).  Once that’s done, we run Ansible tasks from a central server to setup the rest of the Atomic host.

Below is a snippit of most of the playbook.

I think the variables are self-explanatory.   Some notes are added to explain why we’re doing a particular thing.  The disk partitioning is explained in more detail in a previous post of mine.

---
  # Set w/Ansible because cloud-init is plain text
  - name: Access | set root password
    user: 
      name: root
      password: "{{ root_password }}"

  - name: Access | add ssh user keys
    authorized_key:
      user: "{{ item.name }}"
      key: "{{ item.key }}"
    with_items: "{{ ssh_users }}"

  - name: Access | root access to cron
    lineinfile:
      dest: /etc/security/access.conf
      line: "+:root:cron crond"

  - name: Access | fail closed
    lineinfile:
      dest: /etc/security/access.conf
      line: "-:ALL:ALL"

  # docker-storage-setup service re-configures LVM
  # EVERY TIME Docker service starts, eventually
  # filling up disk with millions of tiny files
  - name: Disks | disable lvm archives
    copy:
      src: lvm.conf
      dest: /etc/lvm/lvm.conf
    notify:
      - restart lvm2-lvmetad

  - name: Disks | expand vg with extra disks
    lvg:
      vg: '{{ volume_group }}'
      pvs: '{{ default_pvs }}'

  - name: Disks | expand the lvm
    lvol:
      vg: '{{ volume_group }}'
      lv: '{{ root_lv }}'
      size: 15g

  - name: Disks | grow fs for root
    filesystem:
      fstype: xfs
      dev: '{{ root_device }}'
      resizefs: yes

  - name: Disks | create srv lvm
    lvol:
      vg: '{{ volume_group }}'
      lv: '{{ srv_lv }}'
      size: 15g

  - name: Disks | format fs for srv
    filesystem:
      fstype: xfs
      dev: '{{ srv_device }}'
      resizefs: no

  - name: Disks | mount srv
    mount:
      name: '{{ srv_partition }}'
      src: '{{ srv_device }}'
      fstype: xfs
      state: mounted
      opts: 'defaults'

  ## This is a workaround for XFS bug (only grows if mounted)
  - name: Disks | grow fs for srv
    filesystem:
      fstype: xfs
      dev: '{{ srv_device }}'
      resizefs: yes

  ## Always check this, or it will try to do it each time
  - name: Disks | check if swap exists
    stat:
      path: '{{ swapfile }}'
      get_checksum: no
      get_md5: no
    register: swap

  - debug: var=swap.stat.exists

  - name: Disks | create swap lvm
   ## Shrink not supported until 2.2
   #lvol: vg=atomicos lv=swap size=2g shink=no
    lvol:
      vg: atomicos
      lv: swap
      size: 2g

  - name: Disks |make swap file
    command: mkswap '{{ swapfile }}'
    when:
      - swap.stat.exists == false

  - name: Disks | add swap to fstab
    lineinfile:
      dest: /etc/fstab
      regexp: "^{{ swapfile }}"
      line: "{{ swapfile }}  none    swap    sw    0   0"

  - name: Disks | swapon
    command: swapon '{{ swapfile}}'
    when: ansible_swaptotal_mb < 1   

  - name: Docker | setup docker-storage-setup     
    lineinfile:       
      dest: /etc/sysconfig/docker-storage-setup
      regexp: ^ROOT_SIZE=
      line: "ROOT_SIZE=15G"
      register: docker-storage-setup

  - name: Docker | setup docker-network
    lineinfile: 
      dest: /etc/sysconfig/docker-network 
      regexp: ^DOCKER_NETWORK_OPTIONS= 
      line: >
        'DOCKER_NETWORK_OPTIONS=-H unix:///var/run/docker.sock 
        -H tcp://0.0.0.0:2376 
        --tlsverify 
        --tlscacert=/etc/pki/tls/certs/ca.crt 
        --tlscert=/etc/pki/tls/certs/host.crt 
        --tlskey=/etc/pki/tls/private/host.key'

  - name: add CA certificate
    copy: 
      src: ca.crt 
      dest: /etc/pki/tls/certs/ca.crt 
      owner: root 
      group: root 
      mode: 0644

  - name: Admin Helpers | thinpool wiper script
    copy:
      src: wipe_docker_thinpool.sh
      dest: /usr/local/bin/wipe_docker_thinpool.sh
      mode: 0755

  - name: Journalctl | set journal sizes
    copy:
      src: journald.conf
      dest: /etc/systemd/journald.conf
      mode: 0644
    notify:
      - restart systemd-journald

  - name: Random Atomic Bugfixes | add lastlog
    file:
      path: /var/log/lastlog
      state: touch

  - name: Random Atomic Bugfixes | add root bashrc for prompt
    copy:
      src: root-bashrc
      dest: /root/.bashrc
      mode: 0644

  - name: Random Atomic Bugfixes | add root bash_profile for .bashrc
    copy:
      src: root-bash_profile
      dest: /root/.bash_profile
      mode: 0644

  ### Disable Cloud Init ###
  ## These are in Ansible 2.2, which we don't have yet
  - name: stop cloud-config
    systemd: name=cloud-config state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: stop cloud-init
    systemd: name=cloud-init state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: stop cloud-init-local
    systemd: name=cloud-init-local state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: stop cloud-final
    systemd: name=cloud-final state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: Find old cloud-init files if they exist
    shell: rm -f /etc/init/cloud-*
    ignore_errors: yes

The only other tasks we run are related to $WORK specific stuff (security office scanning, patching user account for automated updates, etc).

One of the beneficial side-effects of mixing cloud-init and Ansible is that the cloud-init is only used for the initial setup (networking and root access), so it ends up being under the size limit imposed by Amazon Web Services on their user-data files.  This allows us to create and maintain RHEL Atomic hosts in AWS using the exact same cloud-init user-data file and Ansible roles.

[UPDATE] Disk partition scheme for RHEL Atomic Hosts

Back in February of this year, I wrote a short piece about the partition scheme I was considering for our RHEL Atomic Host systems at work.  Six months on, I’ve got some real-world experience with the scheme – and it’s time to re-think some things.

RH_atomic_bug_2cBlue_text_rgb

Most of the assumptions remain valid from the last outing in this area – ie: Atomic hosts using a versioned filesystem tree (OSTree) the recommended practice of using direct-lvm  storage for containers and images on production systems, and the need for persistent data.  Really, all that has changed was a better understanding of the size of the storage and how it should be allocated, based on our usage in production for the last six months.

The biggest incorrect assumption from back in the golden days of blissful pre-production was the assertion that the root partition would only need 6G of storage by default.  At the time, the Atomic hosts were shipping with a little less than 3G of packages and whatnot in the OSTree, so I reasoned that double that amount would be fine, allowing us to download a new tree and reboot into it.  Since the majority of the filesystem is read-only and most of the activity was going to occur in the container Thin Pool storage or the persistent data partition, that’s all I thought we’d need.

That, it turns out, was a naive assumption.  The OSTree used by the Atomic hosts is larger now, and that in and of itself would be enough to tip the scales, but we’ve also had problems with a lack of log space for host logs, and container files that aren’t necessarily stored in the Thin Pool (their own logs, for a start*).

* Note: The default logging service for containers in the latest versions of Atomic default to the host journald now, so individual logs are no longer an issue, but the point stands as they’re logged to the host journals now.

I also assumed that the 4G LVM Thin Pool allocation was enough, since it could be expanded as needed.  At the time, most of our containerized services were small, but we quickly started deploying larger services, and it seemed like the OPs guys were being paged every day to add disks to our thin pools.

The only thing I really got dead-on was the persistent storage.  Our services VERY rarely need 15G, and they fit comfortably, but not overly spaciously, in that space.  In the original scheme, though, I put this into it’s own Volume Group, which ended up making it less convenient to expand the storage.  Being in it’s own VG prevents us from adding a single disk and expanding both persistent storage and the root or Thin Pool allocation.  This lead to a ridiculous amount of relatively small virtual disks attached to each system.

Finally, a wider, department-wide, decision was made to increase the default storage size of all new virtual machines from 25G to 50G removed the need to justify using larger disks if needed, and let me now design a scheme to make use of the default size.

The Partition Scheme

That experience has lead to our Partition Scheme v2, making better use of LVM and less concerned with the physical disks:

Physical Disks

50G in total

  • /dev/sda (10G)
  • /dev/sdb (40G)

Partitioning

Once single Volume Group – “atomicos” (the default on RHEL Atomic hosts out of the box) – with three logical volumes:

  • 15G atomicos-root ( / )
  • 15G atomicos-srv ( /var/srv, for persistent data)
  • Thin-provisioned atomicos-docker-pool (LVM Thin Pool)

I’m still using the Atomic vSphere image, and as before, the disk size within the image is 10G – where /dev/sda comes from.   It’s easy enough to add a 40G additional disk, and use it to expand the default “atomicos” Volume Group to 50G.

The Method

Ansible supplants Docker-Storage-Setup

I initially used the docker-storage-setup tool to modify the size of the root partition and configure the Thin Pool.  I was focused on using cloud-init for all of the host configuration, and this was the easiest method.  Now, however, I’ve built out an Ansible infrastructure to do the initial configuration of our Atomic hosts, and use cloud-init only to pass in the SSH keys used to run Ansible.  This ended up being much more convenient, as we could re-run the playbooks to update the hosts’ configurations as needed.

The out-of-the-box disk configuration for RHEL Atomic Host takes care of the thin pool setup, so we only need to add /dev/sdb to the VG, and create/expand/format the LVM partitions.  This is easily accomplished with just a few lines of code:

 

  ## Disk Config
 - name: expand vg with extra disks
   lvg: vg=atomicos pvs=/dev/sda1 /dev/sdb

 - name: expand the lvm
   lvol: vg=atomicos lv=root size=15g

 - name: grow fs for root
   filesystem: fstype=xfs dev=/dev/mapper/atomicos-root resizefs=yes

 - name: create srv lvm
   lvol: vg=atomicos lv=srv size=15g

 - name: format fs for srv
   filesystem: fstype=xfs dev=/dev/mapper/atomicos-srv resizefs=no

 - name: mount srv
   mount: name=/var/srv src=/dev/mapper/atomicos-srv fstype=xfs state=mounted opts='defaults'

 ## This is a workaround for XFS bug (only grows if mounted)
 - name: grow fs for srv
   filesystem: fstype=xfs dev=/dev/mapper/atomicos-srv resizefs=yes

 

Back to the Future (and back again)

Plus ça change, plus c’est la même chose
 Jean-Baptiste Alphonse Karr

This is the plan for RHEL Atomic hosts for the near future.  At the moment, our services are being deployed on individual,small-sized hosts and managed by directly talking to the remote Docker daemon’s API.  We’re using an orchestration tool we developed in-house early on in our container journey.

However… Orchestration is King now.  Containers are hard to work with as a human being, once you get to complexity or scale, and a variety of orchestration tools have come into their own in the last few years.  And orchestration naturally lends itself to clustering.  And clustering naturally lends itself to GINORMOUS servers running lots of services.

Everything old is new again, and it quite possible in the near future we’ll be dealing with a few hundred clustered servers managed by some more standardized orchestration tool. In this case, it’s likely that a lot of this partitioning becomes less and less important, and more efficient.  We’d still make use of a small 15-ishG root partition, but the thin pools and persistent storage would be considerably larger.

Or, does that even work that way at scale?  If the container images share layers, then at scale, each container’s images would be a much smaller fraction of the total.  100 containers sharing 90% of the layers could still fit into a small-ish size.  Perhaps at scale, the thin pool would be only a dozen or so gigabytes larger in size.

Persistent storage ends up becoming a more important matter, and less and less likely to exist on the host at scale.  This would be the time to explore NFS mounts, or Ceph storage, and remove the persistence from the host entirely.  And realistically, with Gluster or Ceph storage drivers for your container engine, even the Thin Pool may not be necessary.  Are we looking at 25G storage attached to 100GB RAM systems managed by OpenShift/Kubernetes in our near future?  It seems likely.

Like it’s predecessor, v2 of our partition scheme is likely to change.

 

 

Disk partition scheme for RHEL Atomic Hosts

I’ve been working on what will likely be the production disk partition system for our RHEL Atomic Host systems at work.  There’s a bit of a balancing act to this setup, with three things to take into consideration.

RH_atomic_bug_2cBlue_text_rgb

First, since these are Atomic hosts, most of the system is made up of a versioned filesystem tree (OSTree).  The OSTree manages all the packages on the system and so there is not much need to mess with the root partition.  It does not take up much space by default – about 1.6 G with the current and last OSTree.

Second, Atomic hosts are designed to run Docker containers.  Docker recommends using direct-lvm on production systems.  An LVM thin pool is created on block devices directly and used to store the image layers.  Each layer is a snapshot created from their parent images, including container layers – they are snapshots of their parent images as well.  Some free space is needed with which to create this thin pool.

Finally, for many services hosted in containers, there has to be a way to store persistent data.  What is considered persistent data varies by the type of service.  Consider, for example, user-uploaded content for a Drupal website, or custom configuration files telling a proxy server how it works, or database data files.  This persistent data needs to live somewhere.

The Partition Scheme

Given all this, it seems the best partition scheme for our use is the following:

/dev/sda:

  • /dev/sda1 – / (6G)
  • LVM Thin Pool – /var/lib/docker (4G †)

/dev/sdb‡:

  • /dev/sdb1 – /var/srv (symlinked to /srv in Atomic, 15G †)

† sizes of these disks could be expanded as needed
‡ /dev/sdb could be replaced with an NFS mount at /var/srv

Our environment is based on the Atomic vSphere image and new Atomic hosts are created from this image.  The disk size within the image is 10G, which is where the size of /dev/sda comes from.  This could be expanded using vmkfstools before the VM is powered on, if needed.  In practice however, 10G covers a lot the minor services that are deployed, and if more space is needed, the LVM pool can be expanded onto another disk while the system is online, and provide more space for images.

The default size of the root partition in Atomic is 3G.  With two OSTrees installed, almost half of that is used up.  It’s useful to expand this to provide some headroom to store the last tree and some logs and incidental data.

Docker-Storage-Setup

Luckily a helper tool, docker-storage-setup, is included in the docker rpm to not only expand the root partition, but also set up the thin pool and configure Docker to use direct-lvm. Docker-storage-setup is a service that runs prior to the Docker service.  To expand the root size to 6G, add the following to /etc/sysconfig/docker-storage-setup.

# /etc/sysconfig/docker-storage-setup
ROOT_SIZE=6G

This file is read by docker-storage-setup each time it runs.  It can be used to specify the default root size, which block devices or volume groups are to be included in the thin pool, how much space is reserved for data and metadata in the thin pool, etc..

(More information about these options can be found in /usr/bin/docker-storage-setup.)

By only setting ROOT_SIZE, docker-storage-setup is allowed to expand the root partition to 6G, and use the rest of /dev/sda for the thin pool.

Persistent Data

Persistent data is special.  It is arguably the only important data on the entire host.  The host itself is completely throw-away;  a new one can be spun up, configured and put into service in less than 10 minutes.  They are designed for nothing more in life than hosting containers.

Images and containers are similarly unimportant.  New images can be pull quickly from a registry in minutes or seconds, and they contain immutable data in any case.

Containers could be considered more important, but if their ephemeral nature is preserved – ie.  nothing important goes into a container – all persistent data is mounted in or stored elsewhere – then they, too are truly unimportant.

So the persistent data lives on another physical disk, and is mounted as a volume into the Docker containers.  It could go somewhere in the root partition, but since the root partition is managed by the OSTree, it’s essentially generic and disposable.  By mounting a dedicated disk for persistent data, we can treat it separately from the rest of the system.

We use the second physical disk so we can then move the disk around to any other Atomic host and the service can be immediately available on the new host.  We can rip out a damaged or compromised root partition and attach the persistent data disk to a fresh install within a few minutes.  Effectively, the persistent data is completely divorced from the host.

The second physical disk can also be left out completely, and an NFS share (or other file store) mounted in it’s place, allowing for load-balancing and automatic scaling.  The NFS share makes it possible to present the data to customers without giving them access to the host directly.

LVM for Change

No battle plan ever survives contact with the enemy.
Helmuth von Moltke the Elder

As always happens, things change.  What works now may not work in a year.  The root filesystem and Docker image thin pools are created with LVM by Atomic, allowing us to expand them easily as necessary.  The second physical disk is given it’s own volume group and logical volume, to allow it to also be expanded easily if we run out of space for persistent data.  Every part of the Atomic host uses LVMs – it’s a key to making the whole system extremely flexible.

A Word of Caution

So far the system is relatively painless to use with a single exception:  measuring the data usage of the thin pool.  It is  important to track the keep track of how much free space is left in the thin pool for both the data and the metadata.  According to Red Hat:

If the LVM thin pool runs out of space it will lead to a failure because the XFS file system underlying the LVM thin pool will be retrying indefinitely in response to any I/O errors.

You should be able to see the amount of space used by the thin pool with the `lvs` command.  However, with the systems I’ve tried (both Atomic and standard RHEL7), the data is left blank:

lvs_problem

I have not yet been able to figure out why this is the case. As a workaround, though, `docker info` can be used to gather the information.  Note the “Data Space Used” and “Metadata Space Used” in the image below.

Screenshot from 2016-02-16 16-49-16