Publishable Stuff

Rasmus Bååth's Blog


A Makefile recipe for Python data pipelines

2023-06-10

If you’ve ever looked at a Makefile in a python or R repository chances are that it contained a collection of useful shell commands (make test -> runs all the unit tests, make lint -> runs automatic formatting and linting, etc.). That’s a perfectly good use of make, and if that’s what you’re after then here’s a good guide for how to set that up. However, the original reason why make was made was to run shell commands, which might depend on other commands being run first, in the right order. In 1976 when Stuart Feldman created make those shell commands were compiling C-programs, but nothing is stopping you from using make to set up simple data pipelines instead. And there are a couple of good reasons why you would want to use make for this purpose:

Here I’ll give you a handy template and some tips for building a data pipeline using python and make. But first, let’s look at an example data pipeline without make.

A python data pipeline, but without make

As an example, let’s take a minimal, somewhat contrived, data pipeline resulting in an analysis of Pokemon data. This data pipeline could easily fit into a single notebook but is here a stand-in for a more complex data pipeline. In this data pipeline, we’ll

  1. Download and clean up a file with Pokemon statistics.
  2. Download files with English and German Pokemon names, and join them together. (Why? I guess we want to cater to the substantial German audience of Pokemon analytics…)
  3. Render a Jupyter notebook analysis of these data as an HTML document.

You can find the full data pipeline code here without output, here with all output, and here’s the final Pokemon “analysis”. But the important part is not the code. What’s important is the problem we have now: As we’ve put each step of this analysis into its own script, we now have several scripts we need to run, ideally in the right order. One way to do that is to stick it all into a shell script, say, run-analysis.sh:

# Downloading and cleaning all the data
python pokemon-stats/01_download.py
python pokemon-stats/02_clean.py
python pokemon-names/01_download-en.py
python pokemon-names/02_download-de.py
python pokemon-names/03_combine.py
# Rendering the final report to html
jupyter nbconvert --execute quick-pokemon-analysis.ipynb --to html

For our minimal example, this will work quite well. But if our data pipeline would grow we would start to get three problems:

  1. Even if we just change something at the end of the pipeline (say quick-pokemon-analysis.ipynb) we’ll have to run the whole pipeline every time.
  2. We have to manually make sure that the python scripts are listed in the right order. Maybe easy when there are six scripts, but it can quickly become unmanageable.
  3. It’s not clear in the shell script what the inputs and output are of each python script.

A python data pipeline, now with make

Switching out the shell script for a Makefile solves these three problems as the Makefile will include the whole dependency graph of the pipeline. That is, what files does each output in the pipeline depend on? For our simple example, the dependency graph looks like this:

Arrows here point to the dependencies so, for example, if either 02_clean.py or pokemon-stats-raw.csv changes we would need to regenerate pokemon-stats.csv. Using this dependency graph make will figure out in what order things need to run and what needs to be re-run (there’s no need to re-run pokemon-stats/01_download.py if we’ve only made changes to quick-pokemon-analysis.ipynb, say).

The Makefile syntax for defining dependency graphs is a little bit unusual, but a Makefile rule looks like this:

the_output_file: the_first_dependency the_second_dependency the_third_dependency etc.
	shell_command_generating_the_output_file

For pokemon-stats.csv in our pipeline the Makefile rule would look like this:

pokemon-stats/pokemon-stats.csv: pokemon-stats/02_clean.py pokemon-stats/pokemon-stats-raw.csv
	python pokemon-stats/02_clean.py

Big Gotcha: The indentation before the shell command needs to be there, and needs to be a tab. Not two spaces, not four spaces, an actual tab character. Accidentally putting spaces there instead of a tab is a very common problem.

We’re now ready to write up the whole dependency graph as a Makefile, but there are a couple of tricks we can use to make this rule prettier and less verbose:

With this, the rule would become:

p := pokemon-stats
$p/pokemon-stats.csv: $p/02_clean.py \
  $p/pokemon-stats-raw.csv
	python $<

(Note that the second dependency, $p/pokemon-stats-raw.csv, has leading spaces while python $< still needs a leading tab)

The full dependency graph as a Makefile (traditionally also given the file name Makefile) would then be:

quick-pokemon-analysis.html: quick-pokemon-analysis.ipynb \
  pokemon-stats/pokemon-stats.csv \
  pokemon-names/pokemon-names.csv 
	jupyter nbconvert --execute $< --to html

p := pokemon-stats
$p/pokemon-stats-raw.csv: $p/01_download.py
	python $<

$p/pokemon-stats.csv: $p/02_clean.py \
  $p/pokemon-stats-raw.csv
	python $<

p := pokemon-names
$p/pokemon-en-names.json: $p/01_download-en.py
	python $<

$p/pokemon-de-names.json: $p/02_download-de.py
	python $<

$p/pokemon-names.csv: $p/03_combine.py \
  $p/pokemon-en-names.json \
  $p/pokemon-de-names.json
	python $<

So, slightly longer than the run-analysis.sh shell script we started out with, but now all the dependencies are there. Note that the order of the rules doesn’t matter except that make, by default, targets making the first output it encounters. Here it’s quick-pokemon-analysis.html, which makes sense, as that’s the ultimate output of the pipeline. To run this you open up a terminal, navigate to the folder of the Makefile and run make, which will give you the following output:

> make
python pokemon-stats/01_download.py
python pokemon-stats/02_clean.py
python pokemon-names/01_download-en.py
python pokemon-names/02_download-de.py
python pokemon-names/03_combine.py
jupyter nbconvert --execute quick-pokemon-analysis.ipynb --to html
[NbConvertApp] Converting notebook quick-pokemon-analysis.ipynb to html
[NbConvertApp] Writing 609122 bytes to quick-pokemon-analysis.html

From this, we can see that make figured out the correct order to run things. Now the exciting thing is: what happens if we run make again?

> make
make: `quick-pokemon-analysis.html' is up to date.

Nothing happens! make has figured out that, because we haven’t changed any of the dependencies since last time, nothing needs to be rerun. What if we make an update to, say, pokemon-names/02_download-de.py and run make again?

> make
python pokemon-names/02_download-de.py
python pokemon-names/03_combine.py
jupyter nbconvert --execute quick-pokemon-analysis.ipynb --to html
[NbConvertApp] Converting notebook quick-pokemon-analysis.ipynb to html
[NbConvertApp] Writing 609122 bytes to quick-pokemon-analysis.html

Then make only reruns what’s needed, according to the dependency graph, and nothing more. That’s the beauty of make!

A python Makefile template

make is an old and deep piece of software, there are many things to learn and also many ways things can go wrong. But here’s a Makefile template that gets you started with using make for python data pipelines.

# Some reasonable makefile settings
# Always use the bash shell instead of the default shell which might be zsh, fish, etc.
SHELL := bash
# Flags to put bash into "strict" mode
.SHELLFLAGS := -eu -o pipefail -c
MAKEFLAGS += --warn-undefined-variables
# We're not compiling C++, so no need for the many built-in rules 
MAKEFLAGS += --no-builtin-rules

p := some/path
$p/an-output.csv: $p/a-script.py \
  $p/a-dependency.csv \
  $p/another-dependency.csv
	python $<
    
$p/another-output.csv: $p/another-script.py \
  $p/an-output.csv
	python $<

# ---------------------------------
# An explanation of a recurring pattern in this Makefile. To avoid repeating the same 
# path multiple times, for scripts that live in folders, for example:
# ---------------------------------
# some/path/an-output.csv: some/path/a-script.py \
#   some/path/a-dependency.csv \
#   some/path/another-dependency.csv
#		python some/path/a-script.py
# ---------------------------------
# We write the above like this instead:
# ---------------------------------
# p := some/path
# $p/an-output.csv: $p/a-script.py \
#   $p/a-dependency.csv \
#   $p/another-dependency.csv
#		python $<
# ---------------------------------
# Some gotchas here:
# * The variable p (as in Path) can be used as $p, but this only works for
#   one-character variables, for longer variables parentheses are $(needed).
# * As p might be redefined througout the script the recipe *can't* be
#   python $p/a-script.py. As recipes are resolved after the Makefile has been parsed 
#   all $p in the recipes will be set to the last value assigned to p.
#   To avoid this we use the "automatic" variable $< which will be replaced
#   by the first pre-requisite. Therefore, the first pre-requisite of every 
#   rule needs to be the python script, otherwise $< will point to a CSV-file or similar.

Some last tips/warnings when using make for setting up data pipelines:

Posted by Rasmus Bååth | 2023-06-10 | Tags: Python