Have you ever felt that running
docker build is slow? Looking at the logs, it seems like Docker is trying to download your dependencies over and over again. Can't Docker intelligently reuse those?
This is a series on understanding why this problem occurs and how we can improve the situation. In this first post of the series, we will examine the file system that Docker containers are using.
The Docker containers are presented a type of file system known as the union mount file systems. There are many specific implementations of such file systems, such as aufs, overlay, and the recommended overlay2.
We can think of a union mount file system as a stack of transparent trays, where each tray contains some files. The rule of the file system is simple:
A quick example. Suppose we start from an empty file system. First, we add a new tray to it:
Next, add a file to it. Let's write
Then, we add another tray to it:
Now, let's edit (i.e., update the body of)
main.js. Notice that
main.js is in layer 0, which is not the top-most layer. By rule (2), we can only write it to layer 1. Hence, after the update, the file system will look like:
Let's add another layer:
We can perform more than one operation in a single layer. Let's create
package.json and delete
main.js. Creating a file is easy:
But how can we delete a file?
main.js exists in layer 0 and 1. By rule (2) again, we can't touch them. Well, most union mount file systems support a special whiteout file to indicate that a file is deleted. We will denote a whiteout by a ~ prefix:
We can also perform multiple operations on a file in a single layer. First, let's create
server.js with some content in it on a new layer:
Then we edit
server.js by changing its content:
Finally, the logical state of this file system:
As Docker goes through a Dockerfile, it builds the union mount file system that will be presented to the container. Each instruction in a Dockerfile corresponds to one layer in the stack. We can think of a build instruction as the instruction to arrange the new tray.
At the same time, Docker will also try to reuse a layer when possible. To be precise, if Docker has seen the next instruction applied with the same files (if applicable) to the current stack before, then it will reuse the tray that it generated last time.
For example, consider a project with 2 files:
lib.js. The Dockerfile looks like this:
FROM node:8 WORKDIR /app COPY main.js . RUN curl https://www.example.com -o index.html COPY lib.js . RUN wc -l main.js CMD node main.js
Suppose that we build the image once, edit
lib.js, then build the image again. In the second build, Docker will reuse the layers up to, and include,
RUN curl, even though curl might produce a different output in the second run. All the subsequent instructions will be executed again. This include
RUN wc -l main.js, although it will output the same thing as the previous build, Docker has not seen this instruction applied to the stack, since
COPY lib.js creates a brand new layer.
This gives us a hint on how we can re-arrange our Dockerfile to speed things up.
We explored what a union mount file system, which is a type of file system that Docker containers use. A union mount file system stores files in a persistent stack. The logical state of the file system is simply the union of the entire stack. In the second half of this series, we will explore the impact of using such a file system, and how we can make use of its characteristics to ensure that our Docker images can be built quickly.
So why does Docker go to such length to use a union mount file system, instead of whatever that the host system is using? The short answer is that we want to save storage.
Suppose that we are building five different Docker images, each of them are derived from the same base (e.g., imagine
FROM ubuntu:18.04). An easy way of building these containers is to copy the base five times, then make changes off each of the copy.
However, this is rather wasteful as the content of these containers are largely the same (the OS files) with only relatively small differences between them (the application-specific files).
With a union mount file system, Docker can reuse the common base and only record the application-specific changes that each image has. Essentially, a union mount file system allows us to perform structural sharing at the file system level.
Now, one might argue that if the host system is using modern file systems that support copy-on-write such as APFS or ZFS, we can achieve a similar saving without a union mount file system. This is partially true, but aside from on-disk storage, union mount file system also save transfer bandwidth as Docker only has to download layers that it does not have.