Docker Images for Data Science — Layering to Minimize Library Rebuilds and Cleaning Data Sets

Published in

The Startup

3 min readApr 1, 2020

Properly designing your Dockerfiles and careful creation of layers in Docker images speeds up cleaning data sets and rebuilding your development environment.

Express library build dependencies and data cleaning workflows in Docker files. Copyright David Eppstein Creative Commons CC0 1.0 Universal Public Domain Dedication

The most irritating part of any Data Science project is creating a reproducible data cleaning and engineering pipeline. It’s not productive to spend time cleaning data when it’s already been cleaned. Early steps have to be implemented and tested properly in order to focus on downstream steps. Cleaning the entire data set from start to finish in order to modify or improve a downstream step is not productive.

Setting up your development environment is similar. It’s not productive to spend time troubleshooting CMake and configure problems, especially when you have to cross-compile or build for IoT devices. A problem with an upstream dependency can cascade, causing a lengthy rebuild.

Dockerfiles help streamline the rebuild workflow and allow you to think in terms of dependencies. Breaking up your data cleaning and code build workflows into steps with separate Dockerfiles allows Docker to decide when an image needs to be rebuilt due to a change in the upstream dependencies. Docker does not rebuild a layer in an image unless it detects that the relevant lines in the Dockerfile or included artifacts have changed. Early steps do not have to be repeated every time you need to execute the workflow due to a downstream workflow change.

At the urging of a member of our development team, we decided to build libraries in separate Docker images. The dependencies and build order are specified using the Docker COPY command in downstream Dockerfiles.

Beginning with a Dockerfile that depends on the base Ubuntu 18.04 and Centos 8.1.1911 images from Docker Hub, we first install all necessary build tools needed by the team. This base image is used by all subsequent images. Downstream Dockerfiles include upstream dependencies.

Import the layers you require for the final Docker image.

Our downstream libraries include OpenBLAS, FFMPEG, OpenCV, Tensorflow, MXNet, Tensorflow Lite (for ARMv7-based devices) and other AI-related libraries, as well as libraries required for these builds.

We use several bash scripts to rebuild the individual upstream images in the proper order. The next enhancement we want to do is to handle dependencies via makefiles.

Our end result produces a Docker image which a developer can pull from Docker Hub. This developer image is suitable for use with any IDE — I’m using Visual Studio for Linux as well as Cloud9. A shell script is use to automate

Starting a container from the image;
Importing the user’s .ssh, .bashrc, and .github configuration files;
Mounting the source tree from their local drive.

My development environment, ready for machine learning.

Pass the source code tree to the docker run command.

Docker Images for Data Science — Layering to Minimize Library Rebuilds and Cleaning Data Sets

Written by Dale Smith, Ph.D.