Back in February of this year, I wrote a short piece about the partition scheme I was considering for our RHEL Atomic Host systems at work. Six months on, I’ve got some real-world experience with the scheme – and it’s time to re-think some things.
Most of the assumptions remain valid from the last outing in this area – ie: Atomic hosts using a versioned filesystem tree (OSTree) the recommended practice of using direct-lvm storage for containers and images on production systems, and the need for persistent data. Really, all that has changed was a better understanding of the size of the storage and how it should be allocated, based on our usage in production for the last six months.
The biggest incorrect assumption from back in the golden days of blissful pre-production was the assertion that the root partition would only need 6G of storage by default. At the time, the Atomic hosts were shipping with a little less than 3G of packages and whatnot in the OSTree, so I reasoned that double that amount would be fine, allowing us to download a new tree and reboot into it. Since the majority of the filesystem is read-only and most of the activity was going to occur in the container Thin Pool storage or the persistent data partition, that’s all I thought we’d need.
That, it turns out, was a naive assumption. The OSTree used by the Atomic hosts is larger now, and that in and of itself would be enough to tip the scales, but we’ve also had problems with a lack of log space for host logs, and container files that aren’t necessarily stored in the Thin Pool (their own logs, for a start*).
* Note: The default logging service for containers in the latest versions of Atomic default to the host journald now, so individual logs are no longer an issue, but the point stands as they’re logged to the host journals now.
I also assumed that the 4G LVM Thin Pool allocation was enough, since it could be expanded as needed. At the time, most of our containerized services were small, but we quickly started deploying larger services, and it seemed like the OPs guys were being paged every day to add disks to our thin pools.
The only thing I really got dead-on was the persistent storage. Our services VERY rarely need 15G, and they fit comfortably, but not overly spaciously, in that space. In the original scheme, though, I put this into it’s own Volume Group, which ended up making it less convenient to expand the storage. Being in it’s own VG prevents us from adding a single disk and expanding both persistent storage and the root or Thin Pool allocation. This lead to a ridiculous amount of relatively small virtual disks attached to each system.
Finally, a wider, department-wide, decision was made to increase the default storage size of all new virtual machines from 25G to 50G removed the need to justify using larger disks if needed, and let me now design a scheme to make use of the default size.
The Partition Scheme
That experience has lead to our Partition Scheme v2, making better use of LVM and less concerned with the physical disks:
50G in total
- /dev/sda (10G)
- /dev/sdb (40G)
Once single Volume Group – “atomicos” (the default on RHEL Atomic hosts out of the box) – with three logical volumes:
- 15G atomicos-root ( / )
- 15G atomicos-srv ( /var/srv, for persistent data)
- Thin-provisioned atomicos-docker-pool (LVM Thin Pool)
I’m still using the Atomic vSphere image, and as before, the disk size within the image is 10G – where /dev/sda comes from. It’s easy enough to add a 40G additional disk, and use it to expand the default “atomicos” Volume Group to 50G.
Ansible supplants Docker-Storage-Setup
I initially used the docker-storage-setup tool to modify the size of the root partition and configure the Thin Pool. I was focused on using cloud-init for all of the host configuration, and this was the easiest method. Now, however, I’ve built out an Ansible infrastructure to do the initial configuration of our Atomic hosts, and use cloud-init only to pass in the SSH keys used to run Ansible. This ended up being much more convenient, as we could re-run the playbooks to update the hosts’ configurations as needed.
The out-of-the-box disk configuration for RHEL Atomic Host takes care of the thin pool setup, so we only need to add /dev/sdb to the VG, and create/expand/format the LVM partitions. This is easily accomplished with just a few lines of code:
## Disk Config - name: expand vg with extra disks lvg: vg=atomicos pvs=/dev/sda1 /dev/sdb - name: expand the lvm lvol: vg=atomicos lv=root size=15g - name: grow fs for root filesystem: fstype=xfs dev=/dev/mapper/atomicos-root resizefs=yes - name: create srv lvm lvol: vg=atomicos lv=srv size=15g - name: format fs for srv filesystem: fstype=xfs dev=/dev/mapper/atomicos-srv resizefs=no - name: mount srv mount: name=/var/srv src=/dev/mapper/atomicos-srv fstype=xfs state=mounted opts='defaults' ## This is a workaround for XFS bug (only grows if mounted) - name: grow fs for srv filesystem: fstype=xfs dev=/dev/mapper/atomicos-srv resizefs=yes
Back to the Future (and back again)
Plus ça change, plus c’est la même chose
– Jean-Baptiste Alphonse Karr
This is the plan for RHEL Atomic hosts for the near future. At the moment, our services are being deployed on individual,small-sized hosts and managed by directly talking to the remote Docker daemon’s API. We’re using an orchestration tool we developed in-house early on in our container journey.
However… Orchestration is King now. Containers are hard to work with as a human being, once you get to complexity or scale, and a variety of orchestration tools have come into their own in the last few years. And orchestration naturally lends itself to clustering. And clustering naturally lends itself to GINORMOUS servers running lots of services.
Everything old is new again, and it quite possible in the near future we’ll be dealing with a few hundred clustered servers managed by some more standardized orchestration tool. In this case, it’s likely that a lot of this partitioning becomes less and less important, and more efficient. We’d still make use of a small 15-ishG root partition, but the thin pools and persistent storage would be considerably larger.
Or, does that even work that way at scale? If the container images share layers, then at scale, each container’s images would be a much smaller fraction of the total. 100 containers sharing 90% of the layers could still fit into a small-ish size. Perhaps at scale, the thin pool would be only a dozen or so gigabytes larger in size.
Persistent storage ends up becoming a more important matter, and less and less likely to exist on the host at scale. This would be the time to explore NFS mounts, or Ceph storage, and remove the persistence from the host entirely. And realistically, with Gluster or Ceph storage drivers for your container engine, even the Thin Pool may not be necessary. Are we looking at 25G storage attached to 100GB RAM systems managed by OpenShift/Kubernetes in our near future? It seems likely.
Like it’s predecessor, v2 of our partition scheme is likely to change.