Plot or Not? Voting Results

Screenshot from plotornot

Screenshot from plotornot

During the SciPy conference last June, Adrian Price-Whelan and I got to commiserating about the ugly default styles of Matplotlib plots — something that also came up at the Matplotlib town-hall meeting. Because Python’s main plotting library was designed as an alternative to Matlab, it inherited much of the appearance and API from that language. This was originally an asset for Matplotlib, as it provided an obvious path for users to migrate away from Matlab. Unfortunately, as Matplotlib has matured, it’s core visual and programmatic design has calcified. People who want to create nice plots in Python are forced to write lots of tweaking code.

Users who push for changes to the default appearance of plots usually face 3 challenges:

– While most agree that Matplotlib would benefit from a better style, there is less consensus on what a replacement style should look like.

– Some are skeptical about the subjectivity of plot aesthetics, and think that “improvements” to plot styles are really just pleasing some users at the cost of others.

– Matplotlib developers want ot avoid changes which might “break” the appearance of legacy user code running in pipelines

As Adrian and I discussed this, we wondered what it would take to integrate substantial stylistic changes into Matplotlib itself. We realized there’s very little data on what kind of Matplotlib plots people actually like. So we decided to collect some. During the SciPy sprints, we put together Plot or Not?, which randomly showed visitors the same Matplotlib plot rendered with two different styles, and asked which one they preferred. People liked it: the site crashed (I apparently don’t know how to make websites with actual traffic), we acquired about 14,000 votes, somebody suggested the name was misogynous, we triggered a good discussion on the Matplotlib developer mailing list, and we promised to share the voting results soon. Then I remembered I had to finish my thesis, and the data sat on a server somewhere for 6 months.

As luck would have it, I found some time to dig into the votes this weekend. You can explore the results at this page, which shows a scatter plot of each Plot or Not image as a function of the fraction of votes it received (X axis) and the margin of victory/defeat (Y axis). Clicking on a point will show you the voting breakdown for a given face-off.

The dataset, it turns out, has a lot of interesting information about what kinds of plots people like. You should explore for yourself, but here are some of the biggest themes I noticed:

People largely share the same aesthetic preferences

Yes aesthetics have a large subjective component. However most people agreed on which plots they preferred. This argues that there are stylistic changes one could make to Matplotlib that would be a net improvement, despite subjectivity.

Legibility is the most important factor

If you look at the heaviest favorites, many of them are comparisons between a plot with easily-seen lines and one whose lines are too thin, too transparent, or too light.

Cam Davidson-Pilon’s Style is Very Good

For some of the plots we generated, we used the settings Cam Davidson-Pilon used in his online book.These were consistently selected as the favorite, and often by large victories like 5:1 or more.

We also used the style from Huy Nguyen’s blog post, which emulates GGPlot — it’s very similar, though it uses a thinner font and linewidth. People slightly preferred Cam’s style in head-to-head comparisons — probably because the lines are easier to see.

People like the dark Color Brewer colors (but not the pastel ones)

Many of the plots in plotornot used colors from People liked line plots and histograms that used the Set1 and Dark2 color palettes. Likewise, people often preferred filled contour plots that used the divergent Color Brewer palettes.

However, people did not like plots drawn with pastel Color Brewer tables (Pastel, Accent, Paired2, Paired3). These are both harder to see, and feel a bit… “Easter-y” (this is a highly scientific adjective). Unpopular colors for contour plots included Accent, Prism, HSV, and gist_stern. All of these palettes cycle through several hues. It is hard to encode scale with Hue, and people preferred palettes restricted to one or two hues. In fairness, some of these multihue palettes would have looked better on images that encode more than ~5 values at a time. Still, the advice from visualization experts seems to be to stick to one- or two-hue colormaps. The latter are best suited in cases where you want to call attention to outliers with both large and small values.

The default Matplotlib colors are almost never preferred

Unlike the Color Brewer colors — which are designed for legibility and coherence — the default Matplotlib color set is pretty arbitrary (blue, green, red, cyan, magenta, yellow, black). These colors don’t work well together, and it shows in the votes. In the few instances where a matplotlib default was preferred, the other plot usually had hard-to-see lines.

An easy improvement

There are a lot of ways one could consider changing styles in matplotlib. The votes from Plot Or Not? suggest a few obvious improvements:

– Use the Set1 or Dark2 Color Brewer palettes for the default line style

– Use a single-hue colormap like ‘gray’ for the default color map.

– Increase the default linewidth from 1 to 2

While the Matplotlib devs are still resistant to changing any defaults, there are some improvements that you will start to see in Matplotlib v1.4. This includes a “style.use” function which will let you easily select style sheets by name or filepath/url. For example, to use the style changes advocated for in this blog post, you could write

from import use

My hope is that Matplotlib will start to build some nice stylesheets that ship with the library. Eventually, I would also love to see a new option for my matplotlibrc file that specifies ‘default_style: latest’ — this would indicate that I am “opting-in” to whatever the Matplotlib developers deem to be the best default style. This style could then incrementally improve with each release, without breaking any legacy code.

In the meantime, the 6 months since SciPy have seen a lot of progress on viz libraries which build on top of (seabornprettyplotlib, a ggplot clone, mpld3, glue) or offer alternatives to (vincent, bokeh) Matplotlib. I’m excited about all of these projects, but hope also that Matplotlib is able to keep evolving to stay modern (I haven’t talked at all about Matplotlib’s API, but I would love to see that improve as well). Matplotlib has solved a lot of problems and remains the most mature library for plotting in Python by far. Even incremental improvements to Matplotlib can have a big effect on the Python community.

Forecasting the Oscars Like a Boss

For the past eight-ish years, I’ve participated in an Oscar prediction pool with my brothers and some of my friends. We’ve fooled around with a number of different scoring schemes. The version we’ve settled on the past few years has been to sort each prediction by how confident we are, and assign point values to each category accordingly. This seems to keep the contest more interesting longer through the awards broadcast.

These contests have typically resulted in my shame. It turns out that I am terrible at guessing which movies are likely to win Oscars (nevermind the fact that I have NO FREAKING IDEA what the difference between sound editing and sound mixing is, but apparently they each need their own categories. Whatever.). My brother Jon has won handily the past 3 years, and that has to stop.

This year, I resolved to systematize my predictions, by building a model to forecast Oscar results. Why do this, you ask?

1) The lucrative $50 prize, and all of the opportunities that that opens up.

2) I’ve been looking for a non-astrophysics data analysis project, and this seemed fun. Also, astrophysicists throw garbage at you when you try to do inference without a physically-justified model, and I wanted to slum it up a bit with some purely-empirical forecasting.

3) I’ve been wanting to spend more time with the Pandas and Scikit-Learn Python libraries.

The Data

The Academy Awards is the last ceremony in a long awards season. The most obvious data to use to try to forecast the Oscars are the results from these previous ceremonies. This is especially true since the same people vote for multiple ceremonies, so these awards can be seen as polls for the oscars. There are other potentially interesting variables to consider (Rotten Tomatoes rating, Box office performance, genre or cast information, etc), but I decided to start with the ceremony results.

The IMDB archives the nominees and winners for the major ceremonies. They don’t provide a nice API for grabbing their data, but it’s easy enough to parse from the HTML (enter: BeautifulSoup). As is so often the case, data cleaning is among the most time-consuming tasks of the project. For example, many award categories change names slightly over time (Best Picture -> Best Motion Picture of the Year). After standardizing all of this information, I ended up with a JSON database of all the nominees and winners for 7 awards ceremonies since 1990 (about 8800 nominations in total).


Before doing anything fancy, I wanted to get a grasp on what the data looked like.  T0 be concrete, I’ll focus on the Best Picture category for the moment. Here’s a plot of what fraction of movies go on to win the best picture Oscar, as a function of whether they were nominated for or won the award in a different ceremony.

Correlation between best picture nominees/winners for the Oscars and other ceremonies

Correlation between best picture nominees/winners for the Oscars and other ceremonies

Interestingly (though not all that surprising in retrospect), winning the Independent Spirit Award (which focuses on indie cinema) is anti-correlated with winning the Oscar. The only movie since 1990 to win both the Independent Spirit and Oscar Best Picture awards was The Artist last year. Also interesting is the fact that the Golden Globes correlates so weakly with the Oscars — this ceremony is often touted as being a good predictor for the Academy Awards, but other ceremonies clearly do better.


How do I combine all of this information into an estimate of who is most likely to win the 2013 Best Picture award? Equally as important, how do I assess how confident this prediction is, so that I can wager more points on categories which are most certain?

One of the standard strategies for estimating success probabilities based on a 0/1 outcome (a movie either wins, or it doesn’t) is logistic regression. It’s pretty much the simplest thing to try, so it’s worth checking how well it does.

There are two wrinkles to this model:

  1.  It doesn’t account for the fact that exactly one nominee in each category wins. I address this by re-normalizing the probabilities in the model within each category, and adjusting the likelihood calculation of the data given the model accordingly. This makes finding the optimal model slightly harder, but the dataset is small enough that computing power isn’t much of  a concern.
  2. With only 23 years of historical data, over-fitting is a danger. To address this, I ran a regularized regression. That is, instead of fitting the model by maximizing the likelihood, I maximize a modified likelihood that penalizes Logistic Regression models with large coefficients. The size of coefficients in a Logistic Regression model directly relates to how confident predictions are, so the penalty acts to make the model more conservative, and less likely to draw too-strong a conclusion from a small training dataset. The strength of the penalty is chosen by cross-validation.

The scikit-learn library is really wonderful for this kind of work. First, they provide a lot of functionality out-of-the-box (optimization, cross validation, and implementations of dozens of models). Furthermore, the API is extremely consistent, so that you can build your own custom classifiers, using scikit-learn objects as building blocks. I definitely plan on using it more, even for more vanilla model fitting and optimization tasks (its API is way better than most of SciPy in my opinion).

There are a number of criteria to evaluate whether this model is a good fit to the data.

How accurate is it?

This model correctly classifies about 75% of the best picture winners since 1990. Furthermore, the years it fails usually correspond to notable upsets; for example, Crash unexpectedly won Best Picture in 2006, despite Brokeback Mountain being a strong favorite. Brokeback Mountain won best picture in every other ceremony in the database, and Crash is the only movie that won the Oscar without even being nominated for the Golden Globe. Other notable upsets include 1999 (when Shakespeare in Love beat Saving Private Ryan) and 1996 (Braveheart won, the model predicted Apollo 13). I see these upsets as indicative of the inherent uncertainty in trying to predict Oscar winners based on other ceremonies.

How representative is the data?

One nice property about the model is that it is generative — you can use it to simulate hypothetical outcomes for each year, based on the information provided from the other ceremonies. If the model is a fair representation of the data, then the actual outcome should look like these simulated outcomes (if it doesn’t, this suggests an over-fit or mis-specified model).

Monte-Carlo Simulation from the Best Picture Forecast Model

Monte-Carlo Simulation from the Best Picture Forecast Model

The left plot shows, for each of 1000 hypothetical results of 23 years of best picture awards, how many years were correctly predicted by the model. The black line is the actual data, and falls near the typical value of this distribution. That’s encouraging.

The right plot is slightly more discerning. It plots, as a function of the confidence at which each prediction is made, how many mistakes were made which had confidences less than this threshold. If these confidences are right, then the model should make most of it’s mistakes at lower confidences — that is, the curve should rise steeply on the left and then level off (I used a similar plot when talking about Nate Silver’s election forecast results). The black lines show 500 simulations, and the red line shows the actual data. Again, it’s encouraging that it runs through the collection of black curves.

This fairly simple model seems to do a pretty good job at characterizing the outcome of the Best Picture category for the past 2 decades. It’s possible that a better model (or additional information about each nominee) could make the predictions more precise (i.e. skew the blue histogram to the right, and make it narrower). I may look into this in the coming weeks. In any event, I’ll be able to make predictions about the 2013 Oscars once the other award ceremonies happen. Look for more blog posts.

I’m coming for you, Jon.

The hardest words to text

I have an old cell phone. It looks like this:

I didn't take this picture with my phone. It is both too old and insufficiently bendy to photograph itself

If you have a phone like this, you know that texting gets annoying — if you want to type “Hi”, you need to press the 4 button twice, wait for the cursor to move over a spot (or hit the right arrow button), and then hit 4 three more times. In other words, the sequence of button presses to type “hi”  — what I’ll call the keypress sequence — is ‘44.444’, where ‘.’ is either a delay or the right arrow key. That’s 6 presses — 3 times the number of letters — to type one of the most common words in the English language. That’s not so good.

Ruminations like this made me wonder what are the best and worst words to type on a keyboard like this. A bad word would require many more button presses than letters — either because lots of the letters the 3rd or 4th on their key (like S), or because several consecutive letters appear on the same key, requiring lots of right arrow presses (like the word “moon”, whose keypress sequence is “6.666.666.66”).

Read the rest of this entry »

Playing Hangman

Remember hangman? Here are the rules: Player A chooses a secret word (say “hollow”) and draws underscores for each letter in that word (“_ _ _ _ _ _”). Player B then guesses a letter. If that letter is in the word (“L”), player A reveals the locations of those letters (“_ _ L L _ _”). If that letter is not in the word (R), then player B gets a strike. If player B gets a certain number of strikes (perhaps 6 or 8 ) before he guesses the word, he loses. Otherwise he wins.

If you are the guesser, what is your strategy for guessing? Or, more nerdily, how would you program a computer to guess at hangman?

Read the rest of this entry »