Buildah: A new way to build container images

Buildah: A new way to build container images

Project Atomic’s new tool, Buildah, facilitates new ways to build container images

A previous post covered a few different strategies for building container images. The first, building container images in place, is what everyone is familiar with from a traditional Docker build. The second strategy, injecting code into a pre-built image, allows developers to add their code to a pre-built environment without really messing with the setup itself. And finally, Asset Generation Pipelines use containers to compile assets that are then included during a subsequent image build, eventually implemented natively by Docker as Multi-Stage Builds. With the introduction of Project Atomic’s new Buildah tool for creating container images, it has become easier to implement a new build strategy that exists as a hybrid of the other three: using development tools installed elsewhere to build or compile code directly into an image.

Segregating build dependencies from production images

Buildah makes it easy to “expose” a working container to the build system, allowing tools on the build system to modify the container’s filesystem directly. The container can then be committed to a container image suitable for use with Docker, Runc, etc. This keeps the build tools from being installed in the image, resutling in a smaller, leaner image.

Using the ever-helpful GNU Hello as an example, consider the following Dockerfile:

FROM fedora:25
LABEL maintainer Chris Collins <[email protected]> RUN dnf install -y tar gzip gcc make
RUN curl http://ftp.gnu.org/gnu/hello/hello-2.10.tar.gz | tar xvz -C /opt
WORKDIR /opt/hello-2.10
RUN ./configure
RUN make
RUN make install
ENTRYPOINT "/usr/local/bin/hello"

This is a relatively straightforward Dockerfile. Hello needs gcc and make to compile, and the container needs tar and gzip to extract the source tarball containing the code. None of these packages are required for Hello to work once it has been built, though. Nor does Hello need any of the dependency packages installed alongside these four, binutils, cpp, gc, glibc-devel, glibc-header, guile, isl, kernel-headers, libatomic_ops, libgomp, libmpc, libstdc++, or libtool-ltdl, or updates to glibc glibc-common, glibc-langpack-en, libcrypt-nss or libgcc. These packages add an extra 48M of data to the resulting image that isn’t needed to run GNU Hello. The extracted source files for Hello itself are another 3.7M.

With Buildah, an image can be built without any extra packages or source files making it into the final image.

#!/usr/bin/env bash
set -o errexit

# Create a container
container=$(buildah from fedora:25)

# Mount the container filesystem
mountpoint=$(buildah mount $container)

# A Buildah-native command to set the maintainer label
buildah config --label maintainer="Chris Collins <[email protected]>" $container

# Download & extract the source files to the host machine
curl http://ftp.gnu.org/gnu/hello/hello-2.10.tar.gz | tar xvz -C /tmp
pushd /tmp/hello-2.10

# Compile the code using make, gcc and their
# dependencies installed on the host machine
./configure
make

# Install Hello into the filesystem of the container
make install DESTDIR=${mountpoint}

popd

# Test that Hello works from inside the container filesystem
chroot $mountpoint bash -c "/usr/local/bin/hello -v"

# Set the entrypoint
buildah config --entrypoint "/usr/local/bin/hello" $container

# Save the container to an image
buildah commit --format docker $container hello

# Cleanup
buildah unmount $container
buildah rm $container

After using Buildah to create a container and mount its filesystem, the source files are extracted to the host.  Hello is compiled using development packages from the host, and then make install DESTDIR=${mountpoint} installs the resulting compiled software to the container’s filesystem. Hello can be run to validate that it works from within the container by using chroot to change to the root of the container before running.

In addition to basic shell commands, a couple of Buildah commands are used to add container-specific information to the working container: ​​​buildah config --label is used to add the “maintainer” label, and buildah config --entrypoint sets the entrypoint.

Finally buildah commit --format docker saves the container to a Docker compatible container image.

This is a simple example, but it gets the general idea across. Of course some software has not only build dependencies, but runtime dependencies, as well. For those use cases, packages can be installed directly into the container’s filesystem with the host’s package manager. For example: dnf install -y --installroot=${mountpoint}.

Drawbacks to this method

Building images this way has some drawbacks, though. By removing the development tools from the image, the compilation of the software is no longer entirely contained in the image itself. The constant refrain of the container evangelists – “Build and Run Anywhere!” – is no longer true.1 When the devel tools are moved to the host, obviously, they must exist on the host. A stock Atomic host has no *-devel packages, so using the method above to build images that require these packages is not practical.2 The container images are no longer reliably reproducible.

A whole new world … er … container

These problems can be solved by using another container to build the image. Rather than installing development tools – or even Buildah – on the host, they can be built into a “builder” image that’s tailored to the type of image being created. For example, a builder image with make, gcc, and any other dependencies can be created to compile GNU Hello. Another image could include php and composer to compile assets for a PHP-based project. A Ruby builder image can be used for Ruby-on-Rails projects. This makes the build environment both portable and reproducible. Any project can contain not only its source code, but also code to create its build environment and production image.

Continuing with the GNU Hello example, a container image with Buildah, make, gcc, gzip, and tar pre-installed can be run, mounting the host’s /var/lib/containers directory and the buildah script from above:

docker run --volume /var/lib/containers:/var/lib/containers:z \
--volume $(pwd)/buildah-hello.sh:/build/buildah-hello.sh:z \
--rm \
--interactive \
--privileged \
--tty buildah-hello:latest /build/buildah-hello.sh

But there’s a catch, at least for now. As of August 2017, using Buildah in the container but not on the host creates an image that is difficult to interact with. The image is not available to the Docker daemon by default, because it’s in /var/lib/containers. Additionally, Buildah itself doesn’t yet support pushing to private registries that require authentication, so it’s challenging to get the image out of the container.

Skopeo, Buildah’s sister tool for moving images around, would be ideal for this. Afterall, that’s the Skopeo project’s …ahem… scope. Unfortunately, Buildah has a known issue that prevents Skopeo from pushing Buildah images to other locations, despite the fact that Skopeo can read and inspect the images.

There are some possible workarounds for now, though. First, if Buildah is installed on the host system, it will be able to read from /var/lib/containers (mounted into the container in the example above, allowing the resulting image to persist on the host), and the buildah push command from the host can copy the image to a local Docker daemon’s storage:

buildah push IMAGE:TAG docker-daemon:IMAGE:TAG

Optionally, if Docker is installed on the host system and in the build container, the host’s Docker socket can be mounted into the container, allowing Buildah to push to the host’s Docker daemon storage.

Buildah builds three ways

So, Buildah can be used to interact directly with the container using tools on the host system, but Buildah also supports other ways of building images. Using buildah bud or “build-using-dockerfile”, an image can be created as simply as using docker build. This method does not have the benefit of segregating development tools from the resulting production image; it’s doing the same exact things Docker would do. On the other hand, Buildah does not create and save intermediate images for each step, so builds are slightly to significantly faster using buildah bud over docker build (depending on the number of external blockers, ie: checking yum mirrors, waiting for code to compile, etc).

Buildah also has its own native commands for interacting with a container, such as buildah run, buildah add, and buildah copy, each generally equivalent to their Docker counterparts. In the examples above, buildah config has been used to set container settings such as labels and the entrypoint. These native commands make it easy to build containers without a Dockerfile, using whatever tool works best for the job – bash, make, etc – but without the full complexity of modifying the container filesystem directly as in the examples above.

Buildah FTW

Buildah is a solid alternative to Docker for building container images, and, as shown, makes it easy to create a container image that includes only the code and packages needed for production. The resulting images are smaller, builds are quicker, and there is less surface area for attack should the container be compromised.3

Using Buildah inside a container with development tools installed adds another layer of portability, allowing images to be built on any host with Runc, and optionally Docker, installed. With this model, the “build anywhere” model of the Dockerfile is maintained while still segregating all the build tools from the resulting image.

Overall, Buildah is a great new way to build container images, and makes it easy to build images faster and leaner. With its build-from-dockerfile support, Buildah makes it easy to be a drop-in replacement for the Docker deamon in build pipelines, and makes gradual migration to more sophisticated build practices less painless.


1: It’s not entirely true anyway, but by removing the build itself from inside the image, now it’s REALLY not true.

2: For the Atomic host example, you could take advantage of package layering to install the tools you need.

3: For whatever that buys you. It’s arguable that not including tools like make or gcc, etc, just adds a hurdle for an attacker but doesn’t actively make it any safer per se.

Header Image: By Pelf at en.wikipedia – Originally from en.wikipedia; description page is/was here., Public Domain, https://commons.wikimedia.org/w/index.php?curid=2747463

Ansible Role for RHEL Atomic Host

Ansible Role for RHEL Atomic Host

This morning I was asked by a friend if I could share any Ansible roles we use at $WORK for our Red Hat Atomic Host servers.  It was a relatively easy task to review and sanitize our configs – Atomic Hosts are so minimal, there’s almost nothing we have to do to configure them.

When our Atomic hosts are initially created, they’re minimally configured via cloud-init to setup networking and add a root user ssh key.  (We have a VMWare environment, so we use the RHEL Atomic .ova provided by Red Hat, and mount an ISO with the cloud-init ‘user-data’ and ‘metadata’ files to be read by cloud-init.).  Once that’s done, we run Ansible tasks from a central server to setup the rest of the Atomic host.

Below is a snippit of most of the playbook.

I think the variables are self-explanatory.   Some notes are added to explain why we’re doing a particular thing.  The disk partitioning is explained in more detail in a previous post of mine.

---
  # Set w/Ansible because cloud-init is plain text
  - name: Access | set root password
    user: 
      name: root
      password: "{{ root_password }}"

  - name: Access | add ssh user keys
    authorized_key:
      user: "{{ item.name }}"
      key: "{{ item.key }}"
    with_items: "{{ ssh_users }}"

  - name: Access | root access to cron
    lineinfile:
      dest: /etc/security/access.conf
      line: "+:root:cron crond"

  - name: Access | fail closed
    lineinfile:
      dest: /etc/security/access.conf
      line: "-:ALL:ALL"

  # docker-storage-setup service re-configures LVM
  # EVERY TIME Docker service starts, eventually
  # filling up disk with millions of tiny files
  - name: Disks | disable lvm archives
    copy:
      src: lvm.conf
      dest: /etc/lvm/lvm.conf
    notify:
      - restart lvm2-lvmetad

  - name: Disks | expand vg with extra disks
    lvg:
      vg: '{{ volume_group }}'
      pvs: '{{ default_pvs }}'

  - name: Disks | expand the lvm
    lvol:
      vg: '{{ volume_group }}'
      lv: '{{ root_lv }}'
      size: 15g

  - name: Disks | grow fs for root
    filesystem:
      fstype: xfs
      dev: '{{ root_device }}'
      resizefs: yes

  - name: Disks | create srv lvm
    lvol:
      vg: '{{ volume_group }}'
      lv: '{{ srv_lv }}'
      size: 15g

  - name: Disks | format fs for srv
    filesystem:
      fstype: xfs
      dev: '{{ srv_device }}'
      resizefs: no

  - name: Disks | mount srv
    mount:
      name: '{{ srv_partition }}'
      src: '{{ srv_device }}'
      fstype: xfs
      state: mounted
      opts: 'defaults'

  ## This is a workaround for XFS bug (only grows if mounted)
  - name: Disks | grow fs for srv
    filesystem:
      fstype: xfs
      dev: '{{ srv_device }}'
      resizefs: yes

  ## Always check this, or it will try to do it each time
  - name: Disks | check if swap exists
    stat:
      path: '{{ swapfile }}'
      get_checksum: no
      get_md5: no
    register: swap

  - debug: var=swap.stat.exists

  - name: Disks | create swap lvm
   ## Shrink not supported until 2.2
   #lvol: vg=atomicos lv=swap size=2g shink=no
    lvol:
      vg: atomicos
      lv: swap
      size: 2g

  - name: Disks |make swap file
    command: mkswap '{{ swapfile }}'
    when:
      - swap.stat.exists == false

  - name: Disks | add swap to fstab
    lineinfile:
      dest: /etc/fstab
      regexp: "^{{ swapfile }}"
      line: "{{ swapfile }}  none    swap    sw    0   0"

  - name: Disks | swapon
    command: swapon '{{ swapfile}}'
    when: ansible_swaptotal_mb < 1   

  - name: Docker | setup docker-storage-setup     
    lineinfile:       
      dest: /etc/sysconfig/docker-storage-setup
      regexp: ^ROOT_SIZE=
      line: "ROOT_SIZE=15G"
      register: docker-storage-setup

  - name: Docker | setup docker-network
    lineinfile: 
      dest: /etc/sysconfig/docker-network 
      regexp: ^DOCKER_NETWORK_OPTIONS= 
      line: >
        'DOCKER_NETWORK_OPTIONS=-H unix:///var/run/docker.sock 
        -H tcp://0.0.0.0:2376 
        --tlsverify 
        --tlscacert=/etc/pki/tls/certs/ca.crt 
        --tlscert=/etc/pki/tls/certs/host.crt 
        --tlskey=/etc/pki/tls/private/host.key'

  - name: add CA certificate
    copy: 
      src: ca.crt 
      dest: /etc/pki/tls/certs/ca.crt 
      owner: root 
      group: root 
      mode: 0644

  - name: Admin Helpers | thinpool wiper script
    copy:
      src: wipe_docker_thinpool.sh
      dest: /usr/local/bin/wipe_docker_thinpool.sh
      mode: 0755

  - name: Journalctl | set journal sizes
    copy:
      src: journald.conf
      dest: /etc/systemd/journald.conf
      mode: 0644
    notify:
      - restart systemd-journald

  - name: Random Atomic Bugfixes | add lastlog
    file:
      path: /var/log/lastlog
      state: touch

  - name: Random Atomic Bugfixes | add root bashrc for prompt
    copy:
      src: root-bashrc
      dest: /root/.bashrc
      mode: 0644

  - name: Random Atomic Bugfixes | add root bash_profile for .bashrc
    copy:
      src: root-bash_profile
      dest: /root/.bash_profile
      mode: 0644

  ### Disable Cloud Init ###
  ## These are in Ansible 2.2, which we don't have yet
  - name: stop cloud-config
    systemd: name=cloud-config state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: stop cloud-init
    systemd: name=cloud-init state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: stop cloud-init-local
    systemd: name=cloud-init-local state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: stop cloud-final
    systemd: name=cloud-final state=stopped enabled=no masked=yes
    ignore_errors: yes

  - name: Find old cloud-init files if they exist
    shell: rm -f /etc/init/cloud-*
    ignore_errors: yes

The only other tasks we run are related to $WORK specific stuff (security office scanning, patching user account for automated updates, etc).

One of the beneficial side-effects of mixing cloud-init and Ansible is that the cloud-init is only used for the initial setup (networking and root access), so it ends up being under the size limit imposed by Amazon Web Services on their user-data files.  This allows us to create and maintain RHEL Atomic hosts in AWS using the exact same cloud-init user-data file and Ansible roles.

Three Docker Build Strategies

Three Docker Build Strategies

There are any number of ways to use containers and numerous ways to build container images.  The creativity of the community never ceases to amaze me – I am always stumbling across a creative new use case or way of doing things.

As our organization at $WORK has adopted containers into our production workflow, I have tried many different permutations of image creation, but most recently I have distilled our process down to three main strategies, all of which coalesced around our use of continuous integration software.

Build In Place

Building in Place is what most people think of when talking about building container images.  In the case of Docker, the docker build command takes a Dockerfile and probably some supporting files are uses them to produce an image.  This is the basic way to produce an image and the other two workflows below make use of this at some point, even if inherited from a parent.

The main benefit of this process is that it is Simple. This is the process as documented on the Docker website.  Have some files.  Run docker build.  Voila!

It is also Transparent.  Everything that happens in the build is documented by the Dockerfile.  There are no surprises.  There are no outside actions acting on the build process that can change the result.*  You the human can see every step of the process laid out in the Dockerfile.

Finally, it is Self-Contained.  Everything needed for the build to succeed is present locally in the directory on your computer.  Give these files to someone else – in a tarball, or a git repo – and they too can build an identical image.

We use the Build in Place method to create our base images.  These builds contain all the sysadmin-y tasks that used to go into setting up a server prior to handing off to a developer to deploy their code: software updates, webserver installation and generic setup, etc.  The images are all generic and with very few exceptions, no real service we use is created from a Build In Place process.

* Unless you have a RUN command that curls a *.sh file from the web somewhere and pipes it to bash.  But in that case you are really just asking for trouble anyway.  And shame on you.

Inject Code

The Inject Code method of building a container image is the most used in our organization.  In this method, a pre-built parent image is created as the result of a Build In Place process.  This image has several ONBUILD instruction in the Dockerfile, so when a child image is created, those steps are executed first.  This allows our CI system to create an empty Dockerfile with the parent image in the FROM instruction, clone a git repo with the developers code, and run docker build.  The ONBUILDinstructions inject the code into the image and run the setup, and we end up with an application-specific container image.

For example, our Ruby on Rails parent image includes instructions such as:

ONBUILD ADD . $APPDIR
ONBUILD RUN bash -x /pick-ruby-version.sh
ONBUILD WORKDIR $APPDIR
ONBUILD RUN gem install bundler \
            && rbenv rehash \
            && bundle install --binstubs /bundle/bin \
                              --path /bundle \
                              --without development test \
                              --deployment \
            && RAILS_ENV=production RAILS_GROUPS=assets \
               bundle exec rake assets:precompile

The major benefit of this build worklow is that it Removes System Administration Tasks from Developers.  The sysadmins build and maintain the parent image, and developers can just worry about their code.

The workflow is also relatively Simple for both the sysadmins and the developers.  Sysadmins effectively use the Build In Place method, and developers don’t actually have to do any builds at all, just commit their code to a repo, triggering the CI build process.

The CI process is effectively just the following two lines (plus tests):

echo "FROM $PARENT_IMAGE" > Dockerfile
docker build -t $CHILD_IMAGE .

The simplicity and hands-off approach of this process is effectively Made for Automation.  With a bit of automation around deploying a container from the resulting image, a developer can create a new app, push it to a git repo and tell the orchestration tool about it, and a new service is created and deployed without any other human involvement.

Unlike the Build In Place process (for which I couldn’t come up with a single real negative), Inject Code has a few gotchas.

The process can be somewhat Opaque.  Developers don’t get a clear view of what exactly is in the parent image or what the build process is going to do with their code when the ONBUILD instructions run, requiring either meticulous documentation by the sysadmins (ha!) (Edit: I was rightly called out for this statement – see below*). tracking down and examining the Dockerfiles for all the upstream images, or inspecting them with the docker history and docker inspect commands.

The build process itself ends up being opaque in practice.  By making it simple and one-step, the tendency is for developers to never look at it, and when the build fails they turn to the sysadmins to figure out what went wrong.  This is really a cultural byproduct of the process, so it might not be an issue everywhere, but it’s what has happened for us.

The Inject Code process also makes it a bit tougher to customize an image for an application.  We have to ship the parent image with multiple copies of ruby, and allow developers to specify which is used with an environment file in the root of their code.  Extra OS packages are handled the same way (think: non-standard libraries).  These end up being handled during the ONBUILD steps, but it’s not ideal.  At some point, if an application needs too much specialization, it’s just easier to go back to the Build In Place method.

* A friend of mine read this after I posted and called me out on the statement here.  I was being a poor team member by not either working with the sysadmins to help solve the problem, explain the necessity or at the very least understand where their frustrations are.  I appreciate the comment, and am glad that my attention was called to it.  It’s too easy to be frustrated and nurture a grudge when in fact the right thing to do is to work together to come to a solution that satisfies both parties.  The former just serves as “wall building” and reinforces silos and poor work culture.

Asset Generation Pipeline

Our final method of generating container images is the Asset Generation Pipeline.  This is a complicated build process that utilizes builder containers to process code or other input in order to generate the assets that go in to building a final image.  This can be as simple as building an RPM package from source and dropping it onto the filesystem to be included in the docker build, or as complicated a multi-container process that compiles code, ingests and manipulates data, and prepares it for the final image (mostly used by researchers).

Some of our developers are using this method to manage Drupal sites, checking out their code from a git repo, and running a builder container on it to compile sass and run composer tasks to prepare the site for actual production, and then including the actual public-facing code in a child image.

The biggest benefit of this process (to me at least) is Minimal Image Size. We can use this process to create final images without having to include any of the build tools that went into creating it.

For example, I use this process to create RPM packages that can then be added to the child image and installed without a) having to do the RPM build or compile from source in the child image build process, or b) include any of the dev tools, libraries, compilers, etc that are needed to create the RPMs.  Our Drupal developers, as mentioned above, can include only the codebase for the production site itself, and none of the meta information or tools needed to produce it.

This process also Reduces Container Startup Time by negating the need to do initialization or asset compilations etc at run-time.  By pre-compiling and adding the results to the child file, the containers can move on to getting started up immediately on docker run.  Given the time required for some of these processes, this is a big plus for us.  Fast startup is good for transparency to end-users, quick auto-scaling for load and reduced service degradation time.

Finally a big benefit of this process is that it can Create Complex Images from Basic Parent Images.   Stringing multiple builder containers along the pipeline allows each container to be created from a simple, single-task parent image.  Each image is minimal and each container has a single, simple job to do, but the end result can be a very complex final image.

Drawback of the Asset Generation Pipeline process are fairly obvious.  First off, it’s fairly Complicated.  The CI jobs that produce the final images are long, and usually time-consuming.  They require a lot of images, and create a lot of containers.  We have to be careful to do efficient garbage collection – nothing is worse than being paged in the middle of the night because a build host ran out of disk space.

They are also More Prone to Failure.  As any good engineer knows, more parts means more points of failure.  The longer the chain, the more things that can go wrong and spoil a build.  This also necessitates better (and more) tests.  Having a half dozen containers prepare your code base mean it could be wrong in a half dozen different ways if your tests aren’t good.

Finally, from a technical perspective, using a pipeline that generates output makes it Difficult to Build Remotely.  Our CI system relies on Jenkins or Gitlab CI host which connects to remove Red Hat Atomic servers to run the docker build command.  This works by cloning repositories locally to the CI host, and sending the build context to the Atomic host.  Unfortunately, generated assets are left on the Atomic host, not in the build context that lives on the CI server.  This necessitates some work arounds to get the assets back into the build context, or in some cases, different build processes that skip the centralized CI servers in favor of custom local builds.

So those are the three primary ways we are building images in production at $WORK.  There are tons of different and creative ways to create images, but these have proven to work for the use cases we have.  That’s not to say there aren’t other legitimate cases, but it’s what we need at the moment, and it works well.  I’d be interested to hear how others are doing their builds.  Do they fit in one of these patterns?  Is it something more unique and cool?  There’s always so much to learn!

 

[UPDATE] Disk partition scheme for RHEL Atomic Hosts

Back in February of this year, I wrote a short piece about the partition scheme I was considering for our RHEL Atomic Host systems at work.  Six months on, I’ve got some real-world experience with the scheme – and it’s time to re-think some things.

RH_atomic_bug_2cBlue_text_rgb

Most of the assumptions remain valid from the last outing in this area – ie: Atomic hosts using a versioned filesystem tree (OSTree) the recommended practice of using direct-lvm  storage for containers and images on production systems, and the need for persistent data.  Really, all that has changed was a better understanding of the size of the storage and how it should be allocated, based on our usage in production for the last six months.

The biggest incorrect assumption from back in the golden days of blissful pre-production was the assertion that the root partition would only need 6G of storage by default.  At the time, the Atomic hosts were shipping with a little less than 3G of packages and whatnot in the OSTree, so I reasoned that double that amount would be fine, allowing us to download a new tree and reboot into it.  Since the majority of the filesystem is read-only and most of the activity was going to occur in the container Thin Pool storage or the persistent data partition, that’s all I thought we’d need.

That, it turns out, was a naive assumption.  The OSTree used by the Atomic hosts is larger now, and that in and of itself would be enough to tip the scales, but we’ve also had problems with a lack of log space for host logs, and container files that aren’t necessarily stored in the Thin Pool (their own logs, for a start*).

* Note: The default logging service for containers in the latest versions of Atomic default to the host journald now, so individual logs are no longer an issue, but the point stands as they’re logged to the host journals now.

I also assumed that the 4G LVM Thin Pool allocation was enough, since it could be expanded as needed.  At the time, most of our containerized services were small, but we quickly started deploying larger services, and it seemed like the OPs guys were being paged every day to add disks to our thin pools.

The only thing I really got dead-on was the persistent storage.  Our services VERY rarely need 15G, and they fit comfortably, but not overly spaciously, in that space.  In the original scheme, though, I put this into it’s own Volume Group, which ended up making it less convenient to expand the storage.  Being in it’s own VG prevents us from adding a single disk and expanding both persistent storage and the root or Thin Pool allocation.  This lead to a ridiculous amount of relatively small virtual disks attached to each system.

Finally, a wider, department-wide, decision was made to increase the default storage size of all new virtual machines from 25G to 50G removed the need to justify using larger disks if needed, and let me now design a scheme to make use of the default size.

The Partition Scheme

That experience has lead to our Partition Scheme v2, making better use of LVM and less concerned with the physical disks:

Physical Disks

50G in total

  • /dev/sda (10G)
  • /dev/sdb (40G)

Partitioning

Once single Volume Group – “atomicos” (the default on RHEL Atomic hosts out of the box) – with three logical volumes:

  • 15G atomicos-root ( / )
  • 15G atomicos-srv ( /var/srv, for persistent data)
  • Thin-provisioned atomicos-docker-pool (LVM Thin Pool)

I’m still using the Atomic vSphere image, and as before, the disk size within the image is 10G – where /dev/sda comes from.   It’s easy enough to add a 40G additional disk, and use it to expand the default “atomicos” Volume Group to 50G.

The Method

Ansible supplants Docker-Storage-Setup

I initially used the docker-storage-setup tool to modify the size of the root partition and configure the Thin Pool.  I was focused on using cloud-init for all of the host configuration, and this was the easiest method.  Now, however, I’ve built out an Ansible infrastructure to do the initial configuration of our Atomic hosts, and use cloud-init only to pass in the SSH keys used to run Ansible.  This ended up being much more convenient, as we could re-run the playbooks to update the hosts’ configurations as needed.

The out-of-the-box disk configuration for RHEL Atomic Host takes care of the thin pool setup, so we only need to add /dev/sdb to the VG, and create/expand/format the LVM partitions.  This is easily accomplished with just a few lines of code:

 

  ## Disk Config
 - name: expand vg with extra disks
   lvg: vg=atomicos pvs=/dev/sda1 /dev/sdb

 - name: expand the lvm
   lvol: vg=atomicos lv=root size=15g

 - name: grow fs for root
   filesystem: fstype=xfs dev=/dev/mapper/atomicos-root resizefs=yes

 - name: create srv lvm
   lvol: vg=atomicos lv=srv size=15g

 - name: format fs for srv
   filesystem: fstype=xfs dev=/dev/mapper/atomicos-srv resizefs=no

 - name: mount srv
   mount: name=/var/srv src=/dev/mapper/atomicos-srv fstype=xfs state=mounted opts='defaults'

 ## This is a workaround for XFS bug (only grows if mounted)
 - name: grow fs for srv
   filesystem: fstype=xfs dev=/dev/mapper/atomicos-srv resizefs=yes

 

Back to the Future (and back again)

Plus ça change, plus c’est la même chose
 Jean-Baptiste Alphonse Karr

This is the plan for RHEL Atomic hosts for the near future.  At the moment, our services are being deployed on individual,small-sized hosts and managed by directly talking to the remote Docker daemon’s API.  We’re using an orchestration tool we developed in-house early on in our container journey.

However… Orchestration is King now.  Containers are hard to work with as a human being, once you get to complexity or scale, and a variety of orchestration tools have come into their own in the last few years.  And orchestration naturally lends itself to clustering.  And clustering naturally lends itself to GINORMOUS servers running lots of services.

Everything old is new again, and it quite possible in the near future we’ll be dealing with a few hundred clustered servers managed by some more standardized orchestration tool. In this case, it’s likely that a lot of this partitioning becomes less and less important, and more efficient.  We’d still make use of a small 15-ishG root partition, but the thin pools and persistent storage would be considerably larger.

Or, does that even work that way at scale?  If the container images share layers, then at scale, each container’s images would be a much smaller fraction of the total.  100 containers sharing 90% of the layers could still fit into a small-ish size.  Perhaps at scale, the thin pool would be only a dozen or so gigabytes larger in size.

Persistent storage ends up becoming a more important matter, and less and less likely to exist on the host at scale.  This would be the time to explore NFS mounts, or Ceph storage, and remove the persistence from the host entirely.  And realistically, with Gluster or Ceph storage drivers for your container engine, even the Thin Pool may not be necessary.  Are we looking at 25G storage attached to 100GB RAM systems managed by OpenShift/Kubernetes in our near future?  It seems likely.

Like it’s predecessor, v2 of our partition scheme is likely to change.

 

 

Disk partition scheme for RHEL Atomic Hosts

I’ve been working on what will likely be the production disk partition system for our RHEL Atomic Host systems at work.  There’s a bit of a balancing act to this setup, with three things to take into consideration.

RH_atomic_bug_2cBlue_text_rgb

First, since these are Atomic hosts, most of the system is made up of a versioned filesystem tree (OSTree).  The OSTree manages all the packages on the system and so there is not much need to mess with the root partition.  It does not take up much space by default – about 1.6 G with the current and last OSTree.

Second, Atomic hosts are designed to run Docker containers.  Docker recommends using direct-lvm on production systems.  An LVM thin pool is created on block devices directly and used to store the image layers.  Each layer is a snapshot created from their parent images, including container layers – they are snapshots of their parent images as well.  Some free space is needed with which to create this thin pool.

Finally, for many services hosted in containers, there has to be a way to store persistent data.  What is considered persistent data varies by the type of service.  Consider, for example, user-uploaded content for a Drupal website, or custom configuration files telling a proxy server how it works, or database data files.  This persistent data needs to live somewhere.

The Partition Scheme

Given all this, it seems the best partition scheme for our use is the following:

/dev/sda:

  • /dev/sda1 – / (6G)
  • LVM Thin Pool – /var/lib/docker (4G †)

/dev/sdb‡:

  • /dev/sdb1 – /var/srv (symlinked to /srv in Atomic, 15G †)

† sizes of these disks could be expanded as needed
‡ /dev/sdb could be replaced with an NFS mount at /var/srv

Our environment is based on the Atomic vSphere image and new Atomic hosts are created from this image.  The disk size within the image is 10G, which is where the size of /dev/sda comes from.  This could be expanded using vmkfstools before the VM is powered on, if needed.  In practice however, 10G covers a lot the minor services that are deployed, and if more space is needed, the LVM pool can be expanded onto another disk while the system is online, and provide more space for images.

The default size of the root partition in Atomic is 3G.  With two OSTrees installed, almost half of that is used up.  It’s useful to expand this to provide some headroom to store the last tree and some logs and incidental data.

Docker-Storage-Setup

Luckily a helper tool, docker-storage-setup, is included in the docker rpm to not only expand the root partition, but also set up the thin pool and configure Docker to use direct-lvm. Docker-storage-setup is a service that runs prior to the Docker service.  To expand the root size to 6G, add the following to /etc/sysconfig/docker-storage-setup.

# /etc/sysconfig/docker-storage-setup
ROOT_SIZE=6G

This file is read by docker-storage-setup each time it runs.  It can be used to specify the default root size, which block devices or volume groups are to be included in the thin pool, how much space is reserved for data and metadata in the thin pool, etc..

(More information about these options can be found in /usr/bin/docker-storage-setup.)

By only setting ROOT_SIZE, docker-storage-setup is allowed to expand the root partition to 6G, and use the rest of /dev/sda for the thin pool.

Persistent Data

Persistent data is special.  It is arguably the only important data on the entire host.  The host itself is completely throw-away;  a new one can be spun up, configured and put into service in less than 10 minutes.  They are designed for nothing more in life than hosting containers.

Images and containers are similarly unimportant.  New images can be pull quickly from a registry in minutes or seconds, and they contain immutable data in any case.

Containers could be considered more important, but if their ephemeral nature is preserved – ie.  nothing important goes into a container – all persistent data is mounted in or stored elsewhere – then they, too are truly unimportant.

So the persistent data lives on another physical disk, and is mounted as a volume into the Docker containers.  It could go somewhere in the root partition, but since the root partition is managed by the OSTree, it’s essentially generic and disposable.  By mounting a dedicated disk for persistent data, we can treat it separately from the rest of the system.

We use the second physical disk so we can then move the disk around to any other Atomic host and the service can be immediately available on the new host.  We can rip out a damaged or compromised root partition and attach the persistent data disk to a fresh install within a few minutes.  Effectively, the persistent data is completely divorced from the host.

The second physical disk can also be left out completely, and an NFS share (or other file store) mounted in it’s place, allowing for load-balancing and automatic scaling.  The NFS share makes it possible to present the data to customers without giving them access to the host directly.

LVM for Change

No battle plan ever survives contact with the enemy.
Helmuth von Moltke the Elder

As always happens, things change.  What works now may not work in a year.  The root filesystem and Docker image thin pools are created with LVM by Atomic, allowing us to expand them easily as necessary.  The second physical disk is given it’s own volume group and logical volume, to allow it to also be expanded easily if we run out of space for persistent data.  Every part of the Atomic host uses LVMs – it’s a key to making the whole system extremely flexible.

A Word of Caution

So far the system is relatively painless to use with a single exception:  measuring the data usage of the thin pool.  It is  important to track the keep track of how much free space is left in the thin pool for both the data and the metadata.  According to Red Hat:

If the LVM thin pool runs out of space it will lead to a failure because the XFS file system underlying the LVM thin pool will be retrying indefinitely in response to any I/O errors.

You should be able to see the amount of space used by the thin pool with the `lvs` command.  However, with the systems I’ve tried (both Atomic and standard RHEL7), the data is left blank:

lvs_problem

I have not yet been able to figure out why this is the case. As a workaround, though, `docker info` can be used to gather the information.  Note the “Data Space Used” and “Metadata Space Used” in the image below.

Screenshot from 2016-02-16 16-49-16

 

 

 

Quick Tip – Docker ENV variables

It took me a little while to notice what was happening here, so I’m writing it down in case someone else needs it.

docker_env

Consider this example Dockerfile:

FROM centos:centos7
MAINTAINER Chris Collins

ENV VAR1="foo"
ENV VAR2="bar"

It’s common practice to collapse the ENV lines into a single line, to save a layer:

FROM centos:centos7
MAINTAINER Chris Collins

ENV VAR1="foo" \
    VAR2="bar"

And after building an image from either of these Dockerfiles, the variables are available inside the container:

[[email protected] envtest]$ docker run -it envtest bash
[[email protected] /]# echo $VAR1
foo
[[email protected] /]# echo $VAR2
bar

I’ve also tried to use ENV vars to create other variables, like you can do with bash:

FROM centos:centos7
MAINTAINER Chris Collins

ENV VAR1="foo" \
 VAR2="Var 1 was set to: ${VAR1}"

This doesn’t work, though.  I assume $VAR1 is not set yet when Docker builds the layer, so it cannot be used in $VAR2.

[[email protected] envtest]$ docker run -it envtest bash
[[email protected] /]# echo $VAR1
foo
[[email protected] /]# echo $VAR2
Var 1 was set to:

Using a single line for each ENV does work, though, as the previous layer has been parsed and added to the environment.

FROM centos:centos7
MAINTAINER Chris Collins
ENV VAR1="foo" 
ENV VAR2="Var 1 was set to: ${VAR1}"

[[email protected] envtest]$ docker run -it envtest bash
[[email protected] /]# echo $VAR1
foo
[[email protected] /]# echo $VAR2
Var 1 was set to: foo

So, while it makes sense to try to collapse ENV lines, to save layers**, there are definitely cases where you’d want to separate them.  I am using this in a Ruby-on-Rails image:

[...]
ENV RUBYPKGS='ruby2.1 mod_passenger rubygem-passenger ruby-devel mysql-devel libxml2-devel libxslt-devel gcc gcc-c++' \
    PATH="/opt/ruby-2.1/bin:$PATH" \
    NOKOGIRI_USE_SYSTEM_LIBRARIES='1' \
    HTTPDMPM='prefork'

ENV APPENV='test' \
    APPDIR='/var/www/current' \
    LOGDIR='/var/log/rails' \

ENV RAILS_ENV="${APPENV}" \
    RACK_ENV="${APPENV}"
[...]

A logical separation of sections is helpful here – the first ENV is for system stuff, the second for generic application setup on the host, and the third to set the application environments themselves.

**I have heard rumblings that in future versions of Docker, the ENV stuff will not be a layer – more like metadata, I think.  If that is the case, the need to collapse the lines will be obsoleted.

Using Docker and AWS to Survive an Outage

Last week at $WORK, we suffered from an outage that slowed down a large part of our network and took down our main website for both internal and external customers.  We were under a distributed denial of service attack focused on the website itself.  The site is load-balanced, and this resulted in slowdowns or outages for all the services behind the load balancers, as well.

While folks were bouncing ideas around on how to bring the site up again while still struggling with the outage, I mentioned that I could pretty quickly migrate the site over to Amazon Web Services and run it in Docker containers there. The higher-ups gave me the go-ahead and a credit card (very important, heh) and told me to get it setup.  The idea was to have it there so we could fail over to the cloud if we were unable to resolve the outage in a reasonable time.

TL;DR – I did, it was easy, and we failed over all external traffic to the cloud. Details below.

Amazon Web Services

DockerDespite having a credit card and a pretty high blanket “OK”, I wanted to make sure we didn’t spend any money unless it was absolutely necessary. To that end, I created three of the “free tier” EC2 instances (1GB RAM, 1 CPU, 10GB Storage) rather than one or more larger instances. After all, these servers were going to be doing one thing and one thing only – running Docker. I took all the defaults, except two. First, I opted to use RHEL7 as the OS. We use Red Hat at work, so I’m familiar with it (and let’s be honest, it works really well), especially where setting up Docker comes in. Second, I set up a security group that allowed only HTTP/HTTPS traffic to the EC2 instances, and SSH access only from $WORK. Security groups are like a logical firewall, I guess – run by Amazon in front of the servers themselves.

The EC2 instances started almost immediately, and I logged in via SSH using the key pair I created for this project. The first thing I did was augment the security group by setting the IPTables firewall on the hosts themselves to match: SSH from $WORK only, drop everything else, even pings.  You know, just in case.

Note: Since I was planning to use Docker to run the website, I didn’t need to add IPTables rules for HTTP/HTTPS. Docker uses the FORWARD chain, since it NATs from the host IP to the containers, and Docker has the ability to add and remove rules from the chain itself as needed.

Next, I ran a quick *yum update* to get the latest patches on the EC2 instance. It wasn’t terribly out of date, so this was quick.

Now to the meat of things. I didn’t really want to muck about with repos or try to find which one was required to install the Docker RPM, so I just copied the RPM for Docker from our local repository. The RPM is packaged upstream by Red Hat, and includes Docker 1.2.1. Even though I wanted to use Docker 1.4.1, the older RPM version is no big deal – I just installed it to get the basic config files – systemd service files, sysconfig, etc. Once the RPM was installed, I downloaded the Docker 1.4.1 binary from Docker.io, and replaced the 1.2.1 binary from the RPM. Presto! Latest Docker with the handy *docker exec* command! At this point, the server itself was basically done, and I moved on to setting up the Docker image.

Time spent so far: About 5 minutes

Docker

Now, I didn’t have an image for our website ready to go or anything – I was going to have to build it from scratch.  However, I’ve been lucky enough to be allowed to play around with Docker at $WORK, and had already done some generic Images for web stacks for our public DockerDemos project (https://dockerdemos.github.io/appstack/), so I was familiar with what I’d need to build the image for our site. I wrote a Dockerfile and built the image on my local laptop to test it. I went through a few revisions to get it perfect, but it only took about 15 minutes to write it from scratch. Once that was ready, I copied the Dockerfile and supporting files up to the EC2 servers, and built the images there. With the magic that is Docker and Linux containers, everything functioned exactly as it did on my laptop, and in a few seconds all three EC2 instances had the website image ready to go.

The final step was to run the container from the image. On all three of the EC2 instances, I ran:

docker run --name website -p 80:80 -p 443:443 -d website && \
docker logs -f website

The first command immediately started up the web servers inside the containers and started to sync their content, and the second opened up STDOUT inside the container so I could watch the progress. In a minute or two the sync was done, and the servers were online!

Note: The “sync” I’m talking about is part of how our website works, not something related to Docker itself.

Total time spent: About 25 minutes

So, in one fell swoop – about a half hour – I was able to create three servers running a Docker image to serve our main website, from scratch. It’s a good thing, too. It wasn’t long before we made the call to fail over, and currently all of our external traffic to the site is being served by these three containers.

That seems cool, no?  But check this out:

Sunday night, I needed to add more servers to the rotation. It was late. I was cranky to have been called after hours. I logged into AWS and used the EC2 “Create Image” feature to commit one of the running instances to a custom image (took about a minute). Then, I spun up three more EC2 instances from that image. They started up as quickly as a normal EC2 instance, and contained all the work I’d already done to set up the first servers, including the Docker package, binary, and image. Once they were up, all I had to do was run the *docker run* command again, and they were ready to go. Elapsed time?

2 minutes

It took longer for the 5 minute time-to-live on our DNS entry to expire.

Docker is Awesome. With AWS, it’s Awesome-er. I’m trying to convince folks that we should leave all of our external traffic to be served by Docker in AWS, and to migrate more sites out there. At the very least, it’s extremely flexible and allows us to respond to issues on a whole different timescale than we could before.

Oh, and an added bonus? All of our external monitoring (in multiple sites across the country) report that our page load speeds have improved 3x compared to what they were on the servers hosted in-house with regular non-Docker setups. I’m investigating what is giving us that increase this week.

Oh, and a second added bonus? For the last five days, our bill from Amazon for hosting our main website is a whopping *$4.69*. That’s a cup of crappy venti mochachino soy caramel crumble arabian dark coffee (or whatever) at the local coffee chain. And I can do without the calories.

Update:

Well, it’s been six months since this little adventure took place.  Since then, this solution has worked so well that we left all of our external traffic pointing to these instances at AWS.  Arguably, things have gotten even easier the more I work with both AWS and Amazon.  To that point:

  1. The Docker approach worked so well that we replaced all of the servers hosting our website internally with basic RHEL7 servers running Docker containers.  The servers are considerably more lightweight than they used to be, and as such we can get better performance out of them.
  2. I’ve since added the new(-ish) Docker flag –restart=always to the deploy command.  This saves me the step of even having to start the containers on reboot.
  3. I setup all the hosts to use the Docker API, and TLS authentication, so I can upload new images and start and stop containers on each host without even needing to login to them.  (This required the opening of port 2376 to $WORK in the security group and host firewall, fyi.)
  4. I wrote a couple of simple bash scripts to re-build the image as needed, and deploy locally for testing.  With the portable nature of Docker images, it’s extremely easy for me to test all changes before I push them out.
  5. Rotating the instances in and out of production at AWS is extremely simple with Amazon’s Elastic IP Addresses.  We are able to rotate a host out of service and instantly replace it with another, allowing us to patch them all with zero downtime.
  6. Amazon’s API is a wonderful thing.  I can manage the entire thing with some python scripts on my laptop, or the convenient Amazon CLI package.

Docker and AWS have proven themselves to me, and though this process, the higher-ups at $WORK.  We’re embracing Docker whole-heartedly in our datacenters here, and we’ve moved a number of services to AWS now, as well.  The ease and flexibility of both is a boon to us, and to our clients, and it’s starting to transform the way we do things in IT – the way we do everything in IT.