Disk partition scheme for RHEL Atomic Hosts

I’ve been working on what will likely be the production disk partition system for our RHEL Atomic Host systems at work.  There’s a bit of a balancing act to this setup, with three things to take into consideration.

RH_atomic_bug_2cBlue_text_rgb

First, since these are Atomic hosts, most of the system is made up of a versioned filesystem tree (OSTree).  The OSTree manages all the packages on the system and so there is not much need to mess with the root partition.  It does not take up much space by default – about 1.6 G with the current and last OSTree.

Second, Atomic hosts are designed to run Docker containers.  Docker recommends using direct-lvm on production systems.  An LVM thin pool is created on block devices directly and used to store the image layers.  Each layer is a snapshot created from their parent images, including container layers – they are snapshots of their parent images as well.  Some free space is needed with which to create this thin pool.

Finally, for many services hosted in containers, there has to be a way to store persistent data.  What is considered persistent data varies by the type of service.  Consider, for example, user-uploaded content for a Drupal website, or custom configuration files telling a proxy server how it works, or database data files.  This persistent data needs to live somewhere.

The Partition Scheme

Given all this, it seems the best partition scheme for our use is the following:

/dev/sda:

  • /dev/sda1 – / (6G)
  • LVM Thin Pool – /var/lib/docker (4G †)

/dev/sdb‡:

  • /dev/sdb1 – /var/srv (symlinked to /srv in Atomic, 15G †)

† sizes of these disks could be expanded as needed
‡ /dev/sdb could be replaced with an NFS mount at /var/srv

Our environment is based on the Atomic vSphere image and new Atomic hosts are created from this image.  The disk size within the image is 10G, which is where the size of /dev/sda comes from.  This could be expanded using vmkfstools before the VM is powered on, if needed.  In practice however, 10G covers a lot the minor services that are deployed, and if more space is needed, the LVM pool can be expanded onto another disk while the system is online, and provide more space for images.

The default size of the root partition in Atomic is 3G.  With two OSTrees installed, almost half of that is used up.  It’s useful to expand this to provide some headroom to store the last tree and some logs and incidental data.

Docker-Storage-Setup

Luckily a helper tool, docker-storage-setup, is included in the docker rpm to not only expand the root partition, but also set up the thin pool and configure Docker to use direct-lvm. Docker-storage-setup is a service that runs prior to the Docker service.  To expand the root size to 6G, add the following to /etc/sysconfig/docker-storage-setup.

# /etc/sysconfig/docker-storage-setup
ROOT_SIZE=6G

This file is read by docker-storage-setup each time it runs.  It can be used to specify the default root size, which block devices or volume groups are to be included in the thin pool, how much space is reserved for data and metadata in the thin pool, etc..

(More information about these options can be found in /usr/bin/docker-storage-setup.)

By only setting ROOT_SIZE, docker-storage-setup is allowed to expand the root partition to 6G, and use the rest of /dev/sda for the thin pool.

Persistent Data

Persistent data is special.  It is arguably the only important data on the entire host.  The host itself is completely throw-away;  a new one can be spun up, configured and put into service in less than 10 minutes.  They are designed for nothing more in life than hosting containers.

Images and containers are similarly unimportant.  New images can be pull quickly from a registry in minutes or seconds, and they contain immutable data in any case.

Containers could be considered more important, but if their ephemeral nature is preserved – ie.  nothing important goes into a container – all persistent data is mounted in or stored elsewhere – then they, too are truly unimportant.

So the persistent data lives on another physical disk, and is mounted as a volume into the Docker containers.  It could go somewhere in the root partition, but since the root partition is managed by the OSTree, it’s essentially generic and disposable.  By mounting a dedicated disk for persistent data, we can treat it separately from the rest of the system.

We use the second physical disk so we can then move the disk around to any other Atomic host and the service can be immediately available on the new host.  We can rip out a damaged or compromised root partition and attach the persistent data disk to a fresh install within a few minutes.  Effectively, the persistent data is completely divorced from the host.

The second physical disk can also be left out completely, and an NFS share (or other file store) mounted in it’s place, allowing for load-balancing and automatic scaling.  The NFS share makes it possible to present the data to customers without giving them access to the host directly.

LVM for Change

No battle plan ever survives contact with the enemy.
Helmuth von Moltke the Elder

As always happens, things change.  What works now may not work in a year.  The root filesystem and Docker image thin pools are created with LVM by Atomic, allowing us to expand them easily as necessary.  The second physical disk is given it’s own volume group and logical volume, to allow it to also be expanded easily if we run out of space for persistent data.  Every part of the Atomic host uses LVMs – it’s a key to making the whole system extremely flexible.

A Word of Caution

So far the system is relatively painless to use with a single exception:  measuring the data usage of the thin pool.  It is  important to track the keep track of how much free space is left in the thin pool for both the data and the metadata.  According to Red Hat:

If the LVM thin pool runs out of space it will lead to a failure because the XFS file system underlying the LVM thin pool will be retrying indefinitely in response to any I/O errors.

You should be able to see the amount of space used by the thin pool with the `lvs` command.  However, with the systems I’ve tried (both Atomic and standard RHEL7), the data is left blank:

lvs_problem

I have not yet been able to figure out why this is the case. As a workaround, though, `docker info` can be used to gather the information.  Note the “Data Space Used” and “Metadata Space Used” in the image below.

Screenshot from 2016-02-16 16-49-16

 

 

 

3 thoughts on “Disk partition scheme for RHEL Atomic Hosts

  1. Hi Chris,

    Thanks for this article. As I’m trying to set up such a host using CentOS Atomic, I’m facing a partition problem.

    I’ve got a single physical drive, large enough for my needs (240GB, SSD). I want it to be split in two:
    a dedicated partition for the containers data (persistent data)
    the rest of the disk to host the os tree.

    I’m unable to set this up. No matter how I try to configure this, it ends with an error, not always the same… I’ve tried a lot of different configuration with Anaconda, but wasn’t able to find something that just works.

    Do you have any clue?

    Like

  2. Hey, nice one.. i have quistion on atomic images, how do you create new images and how you add customizstions to it for example partitions, dns, ntp, authentication.

    How do you update your atomic hosts, does it preserve all customizations

    Like

    1. We’re using Red Hat’s provided image for vSphere for our Atomic hosts right now, because we have not yet upgraded to Satellite 6, and Satellite 5 doesn’t support OSTree stuff yet, so the Anaconda installer wasn’t doing what we needed in our environment, but you could go that route if it works for you (ex: http://www.projectatomic.io/docs/fedora_atomic_bare_metal_installation/)

      Parts of the Atomic host filesystem are writable, including /etc, so you can make configuration changes the same as you would on any other host, and updates preserve those customization.

      Upgrading an Atomic host is done with the `atomic host upgrade` command, which downloads the new tree and reboots into it, similar to CoreOS. More info about that can be found here: http://www.projectatomic.io/docs/os-updates/

      The `atomic` command is really just a wrapper around other commands, so really you’re just doing an `rpm-ostree upgrade`.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s