I’ve been working on what will likely be the production disk partition system for our RHEL Atomic Host systems at work. There’s a bit of a balancing act to this setup, with three things to take into consideration.
First, since these are Atomic hosts, most of the system is made up of a versioned filesystem tree (OSTree). The OSTree manages all the packages on the system and so there is not much need to mess with the root partition. It does not take up much space by default – about 1.6 G with the current and last OSTree.
Second, Atomic hosts are designed to run Docker containers. Docker recommends using direct-lvm on production systems. An LVM thin pool is created on block devices directly and used to store the image layers. Each layer is a snapshot created from their parent images, including container layers – they are snapshots of their parent images as well. Some free space is needed with which to create this thin pool.
Finally, for many services hosted in containers, there has to be a way to store persistent data. What is considered persistent data varies by the type of service. Consider, for example, user-uploaded content for a Drupal website, or custom configuration files telling a proxy server how it works, or database data files. This persistent data needs to live somewhere.
The Partition Scheme
Given all this, it seems the best partition scheme for our use is the following:
- /dev/sda1 – / (6G)
- LVM Thin Pool – /var/lib/docker (4G †)
- /dev/sdb1 – /var/srv (symlinked to /srv in Atomic, 15G †)
† sizes of these disks could be expanded as needed
‡ /dev/sdb could be replaced with an NFS mount at /var/srv
Our environment is based on the Atomic vSphere image and new Atomic hosts are created from this image. The disk size within the image is 10G, which is where the size of /dev/sda comes from. This could be expanded using vmkfstools before the VM is powered on, if needed. In practice however, 10G covers a lot the minor services that are deployed, and if more space is needed, the LVM pool can be expanded onto another disk while the system is online, and provide more space for images.
The default size of the root partition in Atomic is 3G. With two OSTrees installed, almost half of that is used up. It’s useful to expand this to provide some headroom to store the last tree and some logs and incidental data.
Luckily a helper tool, docker-storage-setup, is included in the docker rpm to not only expand the root partition, but also set up the thin pool and configure Docker to use direct-lvm. Docker-storage-setup is a service that runs prior to the Docker service. To expand the root size to 6G, add the following to /etc/sysconfig/docker-storage-setup.
# /etc/sysconfig/docker-storage-setup ROOT_SIZE=6G
This file is read by docker-storage-setup each time it runs. It can be used to specify the default root size, which block devices or volume groups are to be included in the thin pool, how much space is reserved for data and metadata in the thin pool, etc..
(More information about these options can be found in /usr/bin/docker-storage-setup.)
By only setting ROOT_SIZE, docker-storage-setup is allowed to expand the root partition to 6G, and use the rest of /dev/sda for the thin pool.
Persistent data is special. It is arguably the only important data on the entire host. The host itself is completely throw-away; a new one can be spun up, configured and put into service in less than 10 minutes. They are designed for nothing more in life than hosting containers.
Images and containers are similarly unimportant. New images can be pull quickly from a registry in minutes or seconds, and they contain immutable data in any case.
Containers could be considered more important, but if their ephemeral nature is preserved – ie. nothing important goes into a container – all persistent data is mounted in or stored elsewhere – then they, too are truly unimportant.
So the persistent data lives on another physical disk, and is mounted as a volume into the Docker containers. It could go somewhere in the root partition, but since the root partition is managed by the OSTree, it’s essentially generic and disposable. By mounting a dedicated disk for persistent data, we can treat it separately from the rest of the system.
We use the second physical disk so we can then move the disk around to any other Atomic host and the service can be immediately available on the new host. We can rip out a damaged or compromised root partition and attach the persistent data disk to a fresh install within a few minutes. Effectively, the persistent data is completely divorced from the host.
The second physical disk can also be left out completely, and an NFS share (or other file store) mounted in it’s place, allowing for load-balancing and automatic scaling. The NFS share makes it possible to present the data to customers without giving them access to the host directly.
LVM for Change
No battle plan ever survives contact with the enemy.
– Helmuth von Moltke the Elder
As always happens, things change. What works now may not work in a year. The root filesystem and Docker image thin pools are created with LVM by Atomic, allowing us to expand them easily as necessary. The second physical disk is given it’s own volume group and logical volume, to allow it to also be expanded easily if we run out of space for persistent data. Every part of the Atomic host uses LVMs – it’s a key to making the whole system extremely flexible.
A Word of Caution
So far the system is relatively painless to use with a single exception: measuring the data usage of the thin pool. It is important to track the keep track of how much free space is left in the thin pool for both the data and the metadata. According to Red Hat:
If the LVM thin pool runs out of space it will lead to a failure because the XFS file system underlying the LVM thin pool will be retrying indefinitely in response to any I/O errors.
You should be able to see the amount of space used by the thin pool with the `lvs` command. However, with the systems I’ve tried (both Atomic and standard RHEL7), the data is left blank:
I have not yet been able to figure out why this is the case. As a workaround, though, `docker info` can be used to gather the information. Note the “Data Space Used” and “Metadata Space Used” in the image below.