Edward Tufte recommends several principles for representing data in his book “The Visual Display of Quantitative Information“. Here are some that I find especially useful:
- The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities represented (p. 77)
- Write out explanations of the data on the graphic itself. Label important events in the data (p.77)
- Show data variation, not design variation (p. 77)
- Maximize the data-ink ratio. Erase non-data ink and redundant data-ink (p.105)
Python has an array of visualization packages (here‘s a good overview). My go-to package till now has been plotnine, which provides a ggplot2 interface in Python. It’s elegant and has allowed me to escape learning Python’s infamously annoying package, matplotlib. However, for all the flak it gets, matplotlib is the most customizable and powerful visualization package Python offers. It also serves as the base for most other Python visualization tools (including plotnine). So, here’s a guide to implementing Tufte’s principles in Python using matplotlib.
A very short introduction to Matplotlib
A confusing aspect of matplotlib is the existence of two APIs. This is possibly one of it’s biggest flaws. It makes troubleshooting bugs very difficult since answers on StackOverflow frequently jump between the two APIs.
- MATLAB-style API: Matplotlib was originally written to mimic MATLAB, and the pyplot (plt) interface provides a collection of MATLAB-like commands
- Object oriented API: This API is more flexible, and the one to use if you want better control and customization.
I recommend using the object-oriented interface. This ensures that you have standard syntax irrespective of whether your plot is simple or complex
Here are the highlights of my matplotlib reading. It took me about 30 min to get familiar with the basic syntax.
- Lifecycle of a plot: A good matplotlib primer using the object-oriented interface. Pay special attention to the definition of Fig and Axes under ‘A note on the Object-Oriented API vs. Pyplot‘. I recommend keeping this figure open on the side as reference. After reading this you should feel comfortable with ~80% of the visualizations in this post
- Get familiar with Artists: Everything in your plot is basically an Artist. Recognizing this is useful when you want to customize the default instances of your objects.
- Get familiar with GridSpec: This will be useful for the scatter-histogram plots, which we will create in Part 2 of this post.
The plot we will replicate is found in The Visual Display of Quantitative Information, p.68.
First, let’s modify the default matplotlib font. The font in Tufte’s plot is an oldstyle serif font, one where the numerals don’t line up at the top and the bottom. After some Googling I settled on ‘Sabon Roman OsF‘ and downloaded and installed the .ttf version of the font.
To install the font in matplotlib, I deleted the fontList file from matplotlib’s font cache (find this by running
print(matplotlib.get_cachedir())). Next, modify the
rcParams to use our new font as the default serif font (you may have to restart your kernel to get this to work)
Next, using the original plot as reference, I created some data:
Now let’s initialize and modify the figure and axes:
At this point, here’s how our plot looks:
Time to add some data!
Here’s the final plot:
Now that we know our way around Tufte-style line plots, we can replicate a more complicated example from The Visual Display of Quantitative Information, p.75. The dark line-segment between 1955-1956 indicates stricter enforcement by Connecticut policemen against cars exceeding the speed limit. Data from other states is provided for comparison.
In the next post we will look at some other Tufte-style plots, including a minimal boxplot and a dot-dash plot.