Containers Made Easy part 2

Handling Docker image format

In a previous post I’ve described briefly what is a container. Now it’s time to get dirty and build some real stuff. In this part we will focus on the format used to pack images and Docker registry API for pulling content. At the end of this post you will have working code that is capable of downloading some images (tested just few popular ones), unpacking all layers and mount them somewhere into file system. It will be really basic support, lot of stuff will be missing. Remember we are not creating new Docker engine here, we just learn how it works by writing some code. If you want to build full featured docker engine then just grab packages from Moby project (https://mobyproject.org/). You can find all code presented in this post on my GitHub account. So lets start!

Image name format

In order to download and unpack an image we need to know it’s name. At the very beginning there was only one docker image registry and URL was hard coded into binary. There was no way to use local or 3rd party registry. Back then image naming was simple: account_name/image_name[:tag]. If I remember correctly there was no possibility to have image in registry without account name (like just “ubuntu”). But people didn’t want to push all their images into public registry. Also DotCloud (company that became Docker) wanted to have some official images without account name in front. So format evolved into something like: [registry.URL][:port]/[account_name/]image_name[:tag]. Why so many []? Most of stuff there is optional. I’ll describe all elements below:

registry URL - this is the location of image registry without protocol (this defaults to https). Defaults to registry-1.docker.io
port - optional TCP port of registry. Defaults to 443
account name - on official registry this equals user account name.
image name - only required thing. Actual name of image.
tag - image tag, used to version images.

For example full image name may look like this: 127.0.0.1:5000/projectA/workerB:v1.0.0 or like this docker.artifactory.my.corpo.net/teamA/imageB:v3.42. But simple “busybox” is also a valid name. So let’s try to parse a string into data structure that will contain Registry, ImageName and Tag fields:

func ParseImageName(name string) (*Image, error) {
	// get new instance of image data structure
	img := new(Image)
	var err error
	// we split name by '/'. If resulting array have only one element we for sure have default registry. Notice that we split just over first "/".
	repo := strings.SplitN(name, "/", 2)
	if len(repo) == 1 {
		// no custom registry, using default one
    	// getDefaultRegistry() just returns URL to default registry if it was configured or error if not
		img.Registry, err = getDefaultRegistry()
		if err != nil {
			return &Image{}, err
		}
		var imgName string
		// get proper tag. getTag accepts image name with optional tag as param
    	// returns image name and tag (tag will be "latest" if not present)
		imgName, img.Tag = getTag(repo[0])
		// check if we need to add "library/". This is needed if we use official registry and there was no account name.
    	// "ubuntu" is "library/ubuntu:latest" in fact.
		img.ImageName = checkIfLibrary(imgName)
	} else {
    	// we have more than one element after first split
		// check for . in repo[0]. If present we have custom registry.
		if strings.Contains(repo[0], ".") {
			img.Registry = repo[0]
			img.ImageName, img.Tag = getTag(repo[1])
		} else {
			// no custom registry
			img.Registry, err = getDefaultRegistry()
			if err != nil {
				return &Image{}, err
			}
			// we need to add 1st part of image name here as it was not a custom registry URL
			img.ImageName, img.Tag = getTag(repo[0] + "/" + repo[1])
		}
	}

	return img, nil
}

Having this done moves us to next part…

Downloading and parsing image manifest

Image manifest is a simple JSON file that contains information about all the layers we need to mount image and image settings like default command, entry point, etc. This time we will focus only on layers. Data structures used to parse JSON are in docker-manifest.go file on GitHub. So lets download some manifests! This function accepts pointer to image data structure produced by name parser and returns pointer to parsed JSON. In example code on GitHub this is used by function from storage package that checks for local file first so we won’t download manifest every time we need it.

func GetManifest(img *Image) (*DockerManifest, error) {
	// docker registry might require auth token for pulling manifests. If token is not set, try to get it. More on this later
	if img.Token == "" {
		err = img.getAuthToken()
	}
	if err != nil {
		return nil, err
	}
	client := &http.Client{}

  	// URL to get manifest is like "https://registryURL/v2/image/manifests/tag"
	req, err := http.NewRequest("GET", protocol+"://"+img.Registry+"/v2/"+img.ImageName+"/manifests/"+img.Tag, nil)
	if err != nil {
		return nil, err
	}

  	// we inform registry that we wan't to have json in v2 format (we can't parse anything else)
	req.Header.Add("Accept", "application/vnd.docker.distribution.manifest.v2+json")

  	// add token. We do this even if token is not needed. Then registry will just ignore it. We also support only anonymous downloads
	req.Header.Add("Authorization", "Bearer "+img.Token)
  	// make actual request
	resp, err := client.Do(req)
	if err != nil {
		return nil, err
	}

  	// parse it if no error
	var manifest DockerManifest
	decoder := json.NewDecoder(resp.Body)
	err = decoder.Decode(&manifest)
	if err != nil {
		return nil, err
	}
	return &manifest, nil
}

Now I need to explain all this token stuff. Official Docker registry (and some 3rd party registry servers) requires to add token to each request. Even if image is available anonymously. Function below checks if it’s needed and obtains it.

func (i *Image) getAuthToken() error {
	// first we need to check response to GET /v2/ if we will get unauthorized then we will need to obtain token
	resp, err := http.Get(protocol + "://" + i.Registry + "/v2/")
	if err != nil {
		return err
	}
	var realm, service string
	if resp.StatusCode == http.StatusUnauthorized {
	// we need token for this repo. WWW-Authenticate header will tell us where to get it
    // it will contain server name (realm) and service name we want to use.
		re := regexp.MustCompile(`Bearer realm="(?P<realm>.*)",service="(?P<service>.*)"`)
		parsed := re.FindStringSubmatch(resp.Header.Get("WWW-Authenticate"))
		realm, service = parsed[1], parsed[2]
	} else {
		//no token needed
		return nil
	}
  	// once we have realm and service we can ask for anonymous token.
  	// We need to specify scope here as token is valid only for one image and operation type
	resp, err = http.Get(realm + "?service=" + service + "&scope=repository:" + i.ImageName + ":pull")
	if err != nil {
		return err
	}
	var authResponse interface{}
	decoder := json.NewDecoder(resp.Body)
	err = decoder.Decode(&authResponse)
	if err != nil {
		return err
	}
	if resp.StatusCode != 200 {
		return errors.New("Auth status code: " + resp.Status)
	}
  	// set token on the image struct
	i.Token = authResponse.(map[string]interface{})["token"].(string)

	return nil
}

Line above return might look bit cryptic to newcomers but it’s quite simple. We decoded JSON response into interface{} datatype, but I know that encoding/json package put map[string]interface{} there. I also know that I want “token” field from json and it is a string. So I type asserted first interface{} to map[string]interface{} and then interface{} from map to string. I could avoid all this by creating proper data structure for response json and parse json directly to this struct, but I’m too lazy ;) Drawback: it will crash when asserted types doesn’t match, but as all this is just for blog it’s fine. Remember all type assertions presented in this code were made by professionals don’t try this at home. Now when we know what layers go into our image we can start downloading them …

Downloading and unpacking layers

This part is very important one and will allow you to understand how docker images work in details. I’ll skip download part here, you can check it on GitHub. Here I’ll focus on unpack part as this is crucial. When thinking about layers in docker context you need to visualize stack of glass plates. You write something on each of them and then stack them together, one on another. When you look trough them now you will see sum up of all bottom layers. In docker this is implemented as set of directories. Each directory represents one layer. Each one is packed into tar archive for transfer. Adding new files and folders into the image seems quite obvious. Just add new file in layer above previous one and you are done. That is true, but did you ever wondered how docker knows that you deleted a file or folder at some point during image build? You can’t remove it from bottom layer as those are immutable and addressable by checksum. So what do you do? File systems that support layering have the concept of whiteouts and opaque directories. Those are special markers that tell what to show and what to not in final mounted fs. At the beginning docker supported only one layered fs: AUFS. This is why all tar files with image layer diff have AUFS marks in them. Idea here was very simple, to mark a file or directory as deleted a file with special name was created. If file name starts with .wh. (for example .wh.test.txt) prefix then file with that name (test.txt) won’t be visible in upper layers. In similar way if you put file named .wh..wh..opq into directory, then this directory won’t be visible in upper layers. In first post I’ve decided to use overlay as our layered file system. So during unpack we will need to convert all AUFS marks into overlay fs ones. This process is done also in docker engine, it needs to transform whiteouts to format supported by configured storage driver. Opposite thing happens during push, before packing into tar all marks are converted into AUFS format.

Whiteout marks for overlay fs are:

a character device with 0/0 device number for files
an xattr “trusted.overlay.opaque” attribute set to “y”

One important thing to notice here: files on layers are not diffs. Content of file is taken from top most layer that contain this file. This is important thing to remember when you build your own images. For example if you have zip file in lower layer and want to add one file to it, then in fact zip will be copied up from lower layer to upper one and then file will be added to it. It will be slower and consume disk space twice. Any change to file will trigger copy up, even file attribute like permissions. So for example if in Dockerfile you have:

FROM foo
ADD file.bar
RUN chmod +x file.bar

You will end up with 2 copies of file.bar on disk, just in different layers. This is also why it’s important to remove temporary files in same layer they were created (same Dockerfile step). Otherwise it makes no sense as actual space usage will be same. Having all above in mind lets move to code. Our function will accept image struct, layer digest, and compression string. It downloads and unpacks layer to disk. For now only supported layer compression format is tar.gz.

func downloadLayer(img *registry.Image, digest string, compression string) error {
	// download blob from registry, it will be stream of data we need to handle here
	blob, err := registry.GetBlob(img, digest)
	if err != nil {
		return err
	}
	// start unpacking
	gz, err := gzip.NewReader(blob)
	tr := tar.NewReader(gz)
	log.Printf("Downloading and unpacking layer: %s\n", digest)
	// create directory for layer
	if err := os.MkdirAll(filepath.Join(storageRootPath, "blobs", digest), 0755); err != nil {
		return err
	}
	// handle each file from layer
	for {
		hdr, err := tr.Next()
		if err == io.EOF {
			//# end of tar archive
			break
		}
		if err != nil {
			return err
		}
		// set destination path for the file
		dst := filepath.Join(storageRootPath, "blobs", digest, hdr.Name)

		// here we check type of each tar element and handle it in proper way
		switch hdr.Typeflag {
		case tar.TypeDir:
			if _, err := os.Stat(dst); err != nil {
				if err := os.MkdirAll(dst, 0755); err != nil {
					return err
				}
			}
		case tar.TypeReg:
			/* regular file, can be possibly AUFS deletion (whiteout) mark
			docker saves info about deleted files in AUFS format so we need to convert it into overlay ones while unpacking
			AUFS format is file based that is why we check that only on regular file type in checkIfDeleted() function
      		you can check it in more details on GitHub */
			writeFile, err := checkIfDeleted(hdr, dst)
			if err != nil {
				return err
			}
			if !writeFile {
				continue
			}
			// not a del mark, just create the file
			f, err := os.OpenFile(dst, os.O_CREATE|os.O_RDWR, os.FileMode(hdr.Mode))
			defer f.Close()
			if err != nil {
				return err
			}
			if _, err := io.Copy(f, tr); err != nil {
				return err
			}
		case tar.TypeSymlink:
			err := os.Symlink(hdr.Linkname, dst)
			if err != nil {
				return err
			}
		case tar.TypeLink:
			// very naive security check to not have hard links that point outside of container
			if !strings.HasPrefix(dst, storageRootPath+"/blobs") {
				return fmt.Errorf("invalid hardlink %q -> %q", dst, hdr.Linkname)
			}
			err := os.Link(filepath.Join(storageRootPath, "blobs", digest, hdr.Linkname), dst)
			if err != nil {
				return err
			}
		default:
			// we don't handle this type of file, final mounted image may be broken but I didn't found any that gets that far
			// safe to ignore, feel free to add more types if you need some (like character devices, etc)
			fmt.Printf("Unsupported file type found; name: %s\tmode: %v\tdigest: %s\ttarget: %s\n", hdr.Name, hdr.Typeflag, digest, dst)
		}
	}
	return nil
}

In a nut shell we just iterate over tar archive here and handle each file while unpacking it into this pre-created folder structure:

-storageRootPath
|-manifests       <- jsons with name as base64 string from: registry URI + image name + tag
|-blobs           <- image layers
|-containers      <- containers will have their fs here
||-<container_name>
|||-rootfs        <- mounted overlayfs
|||-workdir       <- working layer used internally by overlay
|||-upper         <- top layer that will hold all changes to image

We now have image manifest and properly unpacked layers on our disk. Last step in this post is to …

Mounting image rootfs in our system

This is in fact simplest thing from whole post. We will just prepare mount parameters and call mount syscall. If you want to know more details, documentation about overlay file system can be found for example here. To know more about mount syscall just type man 2 mount in your Linux machine terminal. Lets start with parameters. Our functions accepts pointer to manifest struct and target directory that will hold mounted image. As layers in manifest are in reverse order (top one is last on list) we need to iterate array backwards to have proper layer order.

func prepareOverlayMountOptions(manifest *registry.DockerManifest, target string) string {
	var (
		digests []string
		lowers  string
	)
	for i := len(manifest.Layers) - 1; i >= 0; i-- {
		digests = append(digests, strings.Replace(filepath.Join(storageRootPath, "blobs", manifest.Layers[i].Digest), ":", "\\:", 1))
	}
	lowers = strings.Join(digests, ":")
	return "lowerdir=" + lowers + ",upperdir=" + filepath.Join(target, "upper") + ",workdir=" + filepath.Join(target, "workdir")
}

Option string is quite simple and goes like this: lowerdir=/lower1:/lower2:/lower3,upperdir=/upper,workdir=/work where:

lowerdir - string containing paths that are our layers separated with ‘:‘. No 1 is the top most one.
upperdir - directory that will contain all changes to the image files done while overlay was mounted.
workdir - overlay internal technical directory

Once we have it we can call mount syscall:

func mountImageOverlay(manifest *registry.DockerManifest, target string) error {
	err := createMountTargetDirs(target)
	if err != nil {
		return err
	}
	// get layer locations converted into mount options
	mountOptions := prepareOverlayMountOptions(manifest, target)

	// we could do os.exec here and use os mount command, but go have syscall support so lets use it
	err = syscall.Mount("overlay", filepath.Join(target, "rootfs"), "overlay", 0, mountOptions)
	if err != nil {
		return err
	}
	return nil
}

As I said, nothing super fancy in this part. Plain simple mount call that produces all the layers we downloaded as a single mount point. In next posts we will use it as starting point for running containerized processes. Now time for some …

Sum up

In this post I’ve explained in quite detail how to download and unpack docker image. I’ve also showed how to merge it all into single mount point. All code snippets are from full example app that you can find on GitHub. Start analyze from tmpcli/tmpcli.go. It will be base for my next post in the series. Needs to be run as root to have overlay mount privileges. Here how it looks when started:

root@host# ./tmpcli -n test_cnt -i postgres:10
2017/09/26 14:41:39 Downloading and unpacking layer: sha256:219d2e45b4afc3d80375a2fcf76505684de01f55027fb35a691099f0e538fdd8
2017/09/26 14:41:59 Downloading and unpacking layer: sha256:87b4d6274d7716c4dbf67b92c421750ef3a6513e385dd48a47146219c7a3d77e
2017/09/26 14:42:02 Downloading and unpacking layer: sha256:2569a32ee6dd4d651bd02f3ea71f60d9bc6969c70c26d6306a7b72cbb1870393
2017/09/26 14:42:03 Downloading and unpacking layer: sha256:23b4d0fc31922c229cb33c7c7e01ca0f16ece5822121f709f3a2b61149c07e68
2017/09/26 14:42:03 Downloading and unpacking layer: sha256:8275aae461c7d8db771c1c9b3bcbf3a2956555e62e552f4f5d725c56593aac7f
2017/09/26 14:42:06 Downloading and unpacking layer: sha256:45087ee6fc31c023b9208ea548b0490c254cf452728d877cbb1b1cc9906858f0
2017/09/26 14:42:07 Downloading and unpacking layer: sha256:2d2265f720a6572f001f677339179495c0cd3c90403e22bdfdeed22d8b81f5d9
2017/09/26 14:42:07 Downloading and unpacking layer: sha256:48edada1b7d52064240f510fc7580c02cd355c0d7390454693677bfcdc73118e
2017/09/26 14:42:34 Downloading and unpacking layer: sha256:854fe48abde873832b75f5b7c0d13a3585fbe3c189a4cb66ebdd9d25bb3665a2
2017/09/26 14:42:34 Downloading and unpacking layer: sha256:db74cc6e6ab4315e2233c9482ca33e5ffa46ed826eb5ed8f9553c67c7645c054
2017/09/26 14:42:35 Downloading and unpacking layer: sha256:f9a283997561a1e97c8bd89be09ddcee9a9b2f5f02ee171cd25caec5f5fca8e6
2017/09/26 14:42:35 Downloading and unpacking layer: sha256:f7ebe3ec6405cfa9d0fe72d384a0dde22b249dc906bcdd3d3292599b8d72680b
2017/09/26 14:42:35 Downloading and unpacking layer: sha256:f3a94ccb293fbf3035f8af2fcc55ed0b777a9650cfe2be70c7e82ec33a761b58
2017/09/26 14:42:35 Root path:  /tmp/cme/containers/test_cnt/rootfs
root@host# chroot /tmp/cme/containers/test_cnt/rootfs/
root@inchroot# /usr/bin/psql --version
...
psql (PostgreSQL) 10rc1

We can also see how upperdir gets updated with all changes to the “container” file system:

root@host# ls -l /tmp/cme/containers/test_cnt/upper/root/
-rw------- 1 root root 49 09-26 14:47 /tmp/cme/containers/test_cnt/upper/root/.bash_history

each change will be stored there so this will be the place from which you create next layer in your image. Want to have custom docker image builder? Just mount base image and run some commands on it. Next layer will be in upperdir, pack it, repeat whole process if needed. It could be nice exercise. Hope that it was not boring and you’ve made it to the end. Also that you now know a lot more about docker image internals.