Tufte in Python – Part One

Edward Tufte recommends several principles for representing data in his book “The Visual Display of Quantitative Information“. Here are some that I find especially useful:

  1. The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities represented (p. 77)
  2. Write out explanations of the data on the graphic itself. Label important events in the data (p.77)
  3. Show data variation, not design variation (p. 77)
  4. Maximize the data-ink ratio. Erase non-data ink and redundant data-ink (p.105)

Inspired by Lukasz Piwek‘s implementation of Tufte’s design principles in R, I decided to attempt the same in Python. In this post, I will show you how to create Tufte-style line plots, such as this:

Python has an array of visualization packages (here‘s a good overview). My go-to package till now has been plotnine, which provides a ggplot2 interface in Python. It’s elegant and has allowed me to escape learning Python’s infamously annoying package, matplotlib. However, for all the flak it gets, matplotlib is the most customizable and powerful visualization package Python offers. It also serves as the base for most other Python visualization tools (including plotnine). So, here’s a guide to implementing Tufte’s principles in Python using matplotlib.

A very short introduction to Matplotlib

A confusing aspect of matplotlib is the existence of two APIs. This is possibly one of it’s biggest flaws. It makes troubleshooting bugs very difficult since answers on StackOverflow frequently jump between the two APIs.

  1. MATLAB-style API: Matplotlib was originally written to mimic MATLAB, and the pyplot (plt) interface provides a collection of MATLAB-like commands
  2. Object oriented API: This API is more flexible, and the one to use if you want better control and customization.

I recommend using the object-oriented interface. This ensures that you have standard syntax irrespective of whether your plot is simple or complex

Here are the highlights of my matplotlib reading. It took me about 30 min to get familiar with the basic syntax.

  1. Lifecycle of a plot: A good matplotlib primer using the object-oriented interface. Pay special attention to the definition of Fig and Axes under ‘A note on the Object-Oriented API vs. Pyplot‘. I recommend keeping this figure open on the side as reference. After reading this you should feel comfortable with ~80% of the visualizations in this post
  2. Get familiar with Artists: Everything in your plot is basically an Artist. Recognizing this is useful when you want to customize the default instances of your objects.
  3. Get familiar with GridSpec: This will be useful for the scatter-histogram plots, which we will create in Part 2 of this post.

Minimal lineplot

The plot we will replicate is found in The Visual Display of Quantitative Information, p.68.

First, let’s modify the default matplotlib font. The font in Tufte’s plot is an oldstyle serif font, one where the numerals don’t line up at the top and the bottom. After some Googling I settled on ‘Sabon Roman OsF‘ and downloaded and installed the .ttf version of the font.

To install the font in matplotlib, I deleted the fontList file from matplotlib’s font cache (find this by running print(matplotlib.get_cachedir())). Next, modify the rcParams to use our new font as the default serif font (you may have to restart your kernel to get this to work)

Next, using the original plot as reference, I created some data:

Now let’s initialize and modify the figure and axes:

At this point, here’s how our plot looks:

Time to add some data!

Here’s the final plot:

Now that we know our way around Tufte-style line plots, we can replicate a more complicated example from The Visual Display of Quantitative Information, p.75. The dark line-segment between 1955-1956 indicates stricter enforcement by Connecticut policemen against cars exceeding the speed limit. Data from other states is provided for comparison.


Final figure:

In the next post we will look at some other Tufte-style plots, including a minimal boxplot and a dot-dash plot.

EA Forum: Data Analysis and Deep Learning

Here’s a fun project I undertook this month:

  1. Scrape all posts from the Effective Altruism (EA) Forum
  2. Explore overall trends in the data i.e. posts with the greatest number of comments, authors with the greatest number of posts etc.
  3. Build a wordcloud to visualize the most used words
  4. Fine-tune GPT2 on the EA Forum text corpus and generate text. Here’s a preview of the text GPT2 produced:

>GITC’s Vaccination Prevention Research Project This is the first post of a three part series on the development of effective vaccines. This series will start with a list of possible vaccines that can be developed by the GPI team

Code and data for this project are available at this GitHib repo

1. Scraping

The robots.txt file of EA Forum disallows crawling/scraping data from forum.effectivealtruism.org/allPosts. To get around this, I did the following:

  • Manually loaded yearly links from /allPosts (this required manually clicking each year followed by "Load More")
  • Used a link extractor in Chrome to extract links from the page into a .csv file
  • Used Scrapy to scrape the following fields from each link: ‘date’, ‘author’, ‘title’, ‘number of comments’, ‘number of karma’, and ‘content’. I extracted data for posts published between 01-01-2013 and 05-04-2020. Posts with low karma (below -10) were ignored.

I cleaned the data and restricted subsequent analyses on posts published between 01-01-2013 to 04-15-2020, since recent posts were unlikely to have accumulated comments.

2. Exploratory Data Analysis

2.1 Number of yearly posts

2.2 Posts with the most comments

2.3 Posts with the most karma

2.4 Authors with the most posts

author num_posts
Aaron Gertler 87
Milan_Griffes 83
Peter_Hurford 74
RyanCarey 66
Tom_Ash 58

2.5 Authors with highest mean post karma 

Authors with <2 posts were excluded

author mean_post_karma
Buck 92.2
Jonas Vollmer 77.0
Luisa_Rodriguez 74.7
saulius 73.5
sbehmer 73.0

3. Word Clouds

My next goal was to make a word cloud representing the most commonly used words in the EA Forum. I preprocessed the post content as follows:

  • Tokenized words
  • Expanded word contractions e.g. ‘don’t’ -> ‘do not’
  • Converted all words to lowercase
  • Removed tokens that were only punctuation
  • Filtered out stop words using nltk
  • Removed any tokens containing numbers
  • Removed any tokens containing ‘http’ or ‘www’

The resulting word cloud was built using the Python word_cloud package on ~2.6 million tokens:

The most common words appeared to be ‘one’ and ‘work’. I thought it would be instructive to see if these were over-represented in the EA Forum specifically, or are generally over-represented in other blogs/forums. To generate a control, I scraped all posts from Slate Star Codex (SSC) and performed identical text preprocessing to generate ~1.4 million tokens.

Using R’s wordcloud package I built a "comparative" word cloud showing words over-represented in the EA Forum versus SSC and vice-versa.

What about words that were common between the EA Forum and SSC?

4. GPT2

Finally, I used the text corpus from the EA Forum to http://wiki.fast.ai/index.php/Fine_tuning GPT2. GPT2 is a text-generating language model trained on 8 million web pages on the internet. You can play around with it on Talk to Transformer where the model completes the sentence given a prompt.

For fine-tuning I used this very convenient template Colab notebook made with gpt2-simple. I fine-tuned both the "small" (124 M parameters) and "medium" (355 M parameters) GPT2 models and tried a few temperature settings for text generation.

GPT2 can work with or without a prompt, and typically produces "good" text in 1/25 tries. I’ve cherry picked some outputs I obtained from the fine-tuned models:

  • Without a prompt:

    > What is Rationality in AI? Background Rationality is an ideology of AI, a belief system based on the belief that humans are naturally sentient. It argues that all human actions are responsible for the wellbeing of all sentient beings. This view is based on a large body of evidence. It is not a new, classicist theory of AI. Instead it is a mixture of old and new philosophy.

    > TL;DR’s for the EA Forum/Welcome: ”Effective altruists are trying to figure out how to build a more effective AI, using paperclips, but we’re not really sure how it’s possible to do so.

    > GITC’s Vaccination Prevention Research Project This is the first post of a three part series on the development of effective vaccines. This series will start with a list of possible vaccines that can be developed by the GPI team, ending with a brief overview of the science behind vaccine development. We will then address a variety of questions in the areas of biosecurity, biosecurity technologies and improved vaccine safety.This is a work in progress and we hope to get back to you soon.

  • With a prompt – ‘Introduction to effective altruism’:

    > Introduction to effective altruism [ edit ] The most basic assumption about effective altruism is that we must act rationally. It follows from this that you should maximize the benefit of the action you’re taking. The absolute value of a given intervention is the number of lives saved or improved. If the total benefit of the intervention is too small, you should not take it.

    > Introduction to effective altruism as a means to furthering one’s life, many people I have spoken with who are currently living on less than $10,000 per year have completed some EA-related training. The basic idea of effective altruism is that one should donate your money to the most effective charities. In practice, many people who are involved with EA don’t do this, and they will only use other methods to do it.