Plot or Not? Voting Results

Screenshot from plotornot

Screenshot from plotornot

During the SciPy conference last June, Adrian Price-Whelan and I got to commiserating about the ugly default styles of Matplotlib plots — something that also came up at the Matplotlib town-hall meeting. Because Python’s main plotting library was designed as an alternative to Matlab, it inherited much of the appearance and API from that language. This was originally an asset for Matplotlib, as it provided an obvious path for users to migrate away from Matlab. Unfortunately, as Matplotlib has matured, it’s core visual and programmatic design has calcified. People who want to create nice plots in Python are forced to write lots of tweaking code.

Users who push for changes to the default appearance of plots usually face 3 challenges:

– While most agree that Matplotlib would benefit from a better style, there is less consensus on what a replacement style should look like.

– Some are skeptical about the subjectivity of plot aesthetics, and think that “improvements” to plot styles are really just pleasing some users at the cost of others.

– Matplotlib developers want ot avoid changes which might “break” the appearance of legacy user code running in pipelines

As Adrian and I discussed this, we wondered what it would take to integrate substantial stylistic changes into Matplotlib itself. We realized there’s very little data on what kind of Matplotlib plots people actually like. So we decided to collect some. During the SciPy sprints, we put together Plot or Not?, which randomly showed visitors the same Matplotlib plot rendered with two different styles, and asked which one they preferred. People liked it: the site crashed (I apparently don’t know how to make websites with actual traffic), we acquired about 14,000 votes, somebody suggested the name was misogynous, we triggered a good discussion on the Matplotlib developer mailing list, and we promised to share the voting results soon. Then I remembered I had to finish my thesis, and the data sat on a server somewhere for 6 months.

As luck would have it, I found some time to dig into the votes this weekend. You can explore the results at this page, which shows a scatter plot of each Plot or Not image as a function of the fraction of votes it received (X axis) and the margin of victory/defeat (Y axis). Clicking on a point will show you the voting breakdown for a given face-off.

The dataset, it turns out, has a lot of interesting information about what kinds of plots people like. You should explore for yourself, but here are some of the biggest themes I noticed:

People largely share the same aesthetic preferences

Yes aesthetics have a large subjective component. However most people agreed on which plots they preferred. This argues that there are stylistic changes one could make to Matplotlib that would be a net improvement, despite subjectivity.

Legibility is the most important factor

If you look at the heaviest favorites, many of them are comparisons between a plot with easily-seen lines and one whose lines are too thin, too transparent, or too light.

Cam Davidson-Pilon’s Style is Very Good

For some of the plots we generated, we used the settings Cam Davidson-Pilon used in his online book.These were consistently selected as the favorite, and often by large victories like 5:1 or more.

We also used the style from Huy Nguyen’s blog post, which emulates GGPlot — it’s very similar, though it uses a thinner font and linewidth. People slightly preferred Cam’s style in head-to-head comparisons — probably because the lines are easier to see.

People like the dark Color Brewer colors (but not the pastel ones)

Many of the plots in plotornot used colors from People liked line plots and histograms that used the Set1 and Dark2 color palettes. Likewise, people often preferred filled contour plots that used the divergent Color Brewer palettes.

However, people did not like plots drawn with pastel Color Brewer tables (Pastel, Accent, Paired2, Paired3). These are both harder to see, and feel a bit… “Easter-y” (this is a highly scientific adjective). Unpopular colors for contour plots included Accent, Prism, HSV, and gist_stern. All of these palettes cycle through several hues. It is hard to encode scale with Hue, and people preferred palettes restricted to one or two hues. In fairness, some of these multihue palettes would have looked better on images that encode more than ~5 values at a time. Still, the advice from visualization experts seems to be to stick to one- or two-hue colormaps. The latter are best suited in cases where you want to call attention to outliers with both large and small values.

The default Matplotlib colors are almost never preferred

Unlike the Color Brewer colors — which are designed for legibility and coherence — the default Matplotlib color set is pretty arbitrary (blue, green, red, cyan, magenta, yellow, black). These colors don’t work well together, and it shows in the votes. In the few instances where a matplotlib default was preferred, the other plot usually had hard-to-see lines.

An easy improvement

There are a lot of ways one could consider changing styles in matplotlib. The votes from Plot Or Not? suggest a few obvious improvements:

– Use the Set1 or Dark2 Color Brewer palettes for the default line style

– Use a single-hue colormap like ‘gray’ for the default color map.

– Increase the default linewidth from 1 to 2

While the Matplotlib devs are still resistant to changing any defaults, there are some improvements that you will start to see in Matplotlib v1.4. This includes a “style.use” function which will let you easily select style sheets by name or filepath/url. For example, to use the style changes advocated for in this blog post, you could write

from import use

My hope is that Matplotlib will start to build some nice stylesheets that ship with the library. Eventually, I would also love to see a new option for my matplotlibrc file that specifies ‘default_style: latest’ — this would indicate that I am “opting-in” to whatever the Matplotlib developers deem to be the best default style. This style could then incrementally improve with each release, without breaking any legacy code.

In the meantime, the 6 months since SciPy have seen a lot of progress on viz libraries which build on top of (seabornprettyplotlib, a ggplot clone, mpld3, glue) or offer alternatives to (vincent, bokeh) Matplotlib. I’m excited about all of these projects, but hope also that Matplotlib is able to keep evolving to stay modern (I haven’t talked at all about Matplotlib’s API, but I would love to see that improve as well). Matplotlib has solved a lot of problems and remains the most mature library for plotting in Python by far. Even incremental improvements to Matplotlib can have a big effect on the Python community.


The Great Oscar Postmortem

Well, the Academy Awards are over (congrats to the winners; I know you’ve all been eagerly awaiting my approval) — how did my grand forecasting experiment turn out?

For some context, I’ll compare my forecast to some other statistically driven models: Nate Silver’s forecast, and the 3 other models covered along with mine in the Wall Street Journal. Briefly:

Nate Silver and Ben Zauzmer developed a model largely similar to mine, making predictions by looking at precursor awards ceremonies, and weighting each ceremony by its historical accuracy. Not surprisingly, our predictions are very similar.

David Rothschild from Microsoft Research used a fancier / more mysterious combination of market data and crowdsourcing to make predictions. Most of his predictions are quoted at high confidence. He was also able to make predictions for all 24 categories (including the 3-4 categories where it’s hard to find relevant data from precursor ceremonies).

Farsite also seemed to aggregate information from a broader pool than precursor ceremonies. Their predictions tended to fall between the “precursor models” and the Microsoft model.

I wrote about my predictions in the major categories on this blog, and published my final set of predictions on a google spreadsheet and NYTimes ballot before the ceremony. Unfortunately, some last-minute debugging changed one of my screenwriting predictions an hour before the ceremony (the revised prediction was correct, the original was wrong). Since I had previously published a different prediction, I will count that category as a mis-prediction here. Note also that I skipped the 3 short-subject categories and Production Design category, as I didn’t have enough relevant data to make a prediction. I’ve excluded those categories from the following analysis.

Here are the results

Top 6 Categories (Directing, Acting, Picture)

Beaumont: 5/6

Zauzmer: 5/6

Farsite: 5/6

Nate Silver: 4/6

Microsoft Research: 4/6

Only Farsite correctly predicted that Christoph Waltz would win for Best Supporting Actor. Ben and I lucked out when Ang Lee wonBest Director (everyone else predicted a Spielberg win). We both claimed this category was essentially a toss-up, so we can’t claim to be too prescient here. However, our prediction seems more plausible than the Farsite/Microsoft predictions, which had Spielberg as a >3:1 favorite (again, though, this is only 1 data point).

Top 20 Categories (excluding Production Design and short subject categories)

Beaumont: 17/20

Zauzmer: 17/20

Microsoft: 17/20

Farsite/Silver: No Predictions

We each made 1-2 additional mistakes in the minor categories, albeit in different categories: I mis-called the Animated category and (thanks to the bug discussed above) mis-predicted Adapted Screenplay up until an hour before the awards. Ben missed Original Screenplay and Cinematography. Microsoft missed the Makeup Category.


The raw accuracy is the most interesting statistic, but not the entire story. A good model should also be calibrated — a prediction made with X% confidence should be correct about X% of the time. Predictions substantially more or less accurate than this are mis-calibrated.

At first glance, it seems that the Microsoft model is best calibrated. The average prediction confidence in this model is about 80%, comparable to the overall accuracy. By contrast, Ben’s and my models makes predictions at ~55% confidence on average. In other words, our models were too conservative, based on how well they did.

In a few previous posts, I’ve discussed a method for visualizing model calibration. In essence, the idea is to use a model’s prediction confidence to simulate outcomes. For each simulation, we can plot the differences between the model prediction and simulated outcome. Finally, we overplot the actual outcome on top of these simulations. If the model is well-calibrated, the “reality line” should overlap the simulations. Let’s take a look at that:

Microsoft Model

Microsoft Model

Beaumont Model

Zauzmer ModelEach line in these plots shows how many mis-predictions were made at a given confidence level or greater (for example, the Microsoft model made 3 mistakes at confidence > 0%, 2 mistakes at confidence >50%, 1 mistake at confidence > 70%, and no mistakes at confidence >~75%). The red lines show 1000 simulations, the black line is the average simulation, the light/dark bands are the central 40%/80% of simulations, and the blue line is the actual performance.

These plots confirm and quantify what I said above — the models that Ben and I put together are too conservative, while the Microsoft model seems nicely calibrated.  I have to admit I am both impressed by how well the “magic” Microsoft model calibrated itself, and curious about the under-confidence of my own model. I’ll be chewing on that in the coming days. Here are a few things I’ll be thinking about

  • My predictions did better than I expected, based on testing on historical data. I was expecting to miss ~5 categories. After catching my screenwriting bug, I only missed 2. That’s mild evidence that this years’ Oscars played out more like the precursor awards, which explains some of the under-confidence.
  • I used regularized regression to optimize my model; the “regularization” means the overall confidence of the model is adjusted up or down to match historical data. I’m starting to wonder if there’s any asymmetry in that process such that, given the relative simplicity/inflexibility of my model, the regularization prefers under-confidence. Who knows.

So… yay math?

So at the end of the day, is this worth it (I’m asking in the sense of ‘is there a significant edge to model-based predictions’, and not ‘is this a waste of time’)? Certainly, basing predictions off of precursor awards is a huge advantage over ignoring the data — my oscar guesses from previous years were usually <~50% accurate, and not 85%.

What about the harder question of whether modeling is better than the naive strategy of looking at the precursor awards, and predicting whichever film has won the most awards? That simple strategy largely yields the same set of predictions. The few categories where modeling matters are close calls, where there is no obvious nominee with a plurality of precursor wins. The best example from this year was the Best Actress Category; both Jennifer Lawrence and Jessica Chastain won precursor awards. However, the mathematical models all realized that Jennifer Lawrence was more successful in the more influential awards (e.g. SAG), and correctly predicted her as a clear favorite.

Anyways, that’s a rather large brain dump. This was fun. And I FINALLY eeked out an Oscar pool victory against my brother. Mission accomplished. Next year, I’m going after Microsoft.

Oscar Showdown

Here’s the full breakdown of Oscar forecasts from me, and the 3 others interviewed in the Wall Street Journal. Should make for an interesting post mortem!

Forecasting the Oscars Like a Boss: The Predictions


This post has been covered on the Wall Street Journal!

Last post, I described my mildly obsessive strategy for making predictions in my Oscar pool this year. I’ve been driven to such measures by the repeated and humiliating losses to my brother Jon for the past quarter-score.

To recap that post: the common wisdom among Oscar pundits is that the “precursor” awards which happen before the Academy Awards (e.g. the Golden Globes, Screen Actors / Directors / Producers Guild, BAFTA, and the Critics Choice awards) tend to correlate with who wins Oscars. From what I understand, Jon looks over these when choosing a winner. I tried the same thing last year, and was on track to win until Meryl Streep won Best Actress in an upset (jerk). That night i made a vow – never again.

So, I decided to take the same approach this year, but make it more systematic. I grabbed 20 years worth of award data from the Internet Movie Database, and built a model that takes into account the degree to which each precursor Ceremony predicts the Oscars (different ceremonies do better in different categories). My theory is that this might provide an edge for close calls. Our Oscar pool also incorporates an interesting twist, in that we have some freedom to down-weight predictions that we aren’t sure of. My model makes probabilistic estimates, and thus also gives a strategy for how to weight each prediction.

Now that all of the precursor awards have taken place, I’m able to apply the model for the 2013 awards. Here are the main results (who really cares about Best Live Action short? Sorry, guy nominated for Best Live Action Short), with some punditry for good measure:

Best Picture

Argo (58%)

Les Miserables (12%)

Argo has swept the dramatic awards, and Les Mis won for best Golden Globe (Comedy/Musical). The last time a movie swept the dramatic awards and lost Best Picture was when Brokeback Mountain lost to Crash.

Best Actor

Daniel Day Lewis (65%)

Hugh Jackman / Denzel Washington (10%)

Another straightforward call, as Daniel Day Lewis has swept the dramatic acting categories this year. Plus, people love that guy. A DDL loss would be a repeat of 2002, when Denzel Washington (Training Day) unexpectedly beat Russel Crowe (A Beautiful Mind), who also swept the dramatic awards. The SAG and Critics Choice awards best predict this category.

Best Actress

Jennifer Lawrence (70%)

Jessica Chastain (20%)

Sorry, Quvenzhané Wallis. You may be who the Earth is for, but you aren’t who this award is for. Choosing between Lawrence and Chastain is tricky. The former won the Golden Globe comedy award and the SAG award. Jessica Chastain won the Golden Globe drama award and the critics choice award. The SAG is the best predictor, and thus the model prefers Jennifer Lawrence. If I hadn’t used a model, I would have guessed Jessica Chastain, thinking that dramatic movies have a better shot at winning Oscars than rom-coms. C’mon, math…

Supporting Actress

Anne Hathaway (87%)

Everyone else (2-5%)

The easiest prediction of the bunch. She’s been unbeatable in other ceremonies, so there’s no reason not to pick her based on the data.

Supporting Actor

Tommy Lee Jones (47%)

Christoph Waltz (30%)

This seems to be the most controversial category. The New York Times is predicting that Robert DeNiro will win, based on his aggressive oscar campaigning and his icon status (two factors not present in my model, which thinks he has a 5% shot). Tommy Lee Jones won the SAG, which best predicts the acting categories. Christoph Waltz won both the Golden Globe and BAFTA. Historically, this is a difficult category to predict based on precursor ceremonies.


Who knows? Ben Affleck has swept the other ceremonies, but was notoriously not nominated for an Oscar this year. Thus, there’s very little award information to go off of.

My model gives a slight preference to Ang Lee/Life of Pi (45%) over Steven Spielberg/Lincoln (40%), based on which ceremonies each was nominated for. My model isn’t really precise to within 5%, so it isn’t a statistically significant edge. Personally, I’m inclined to think that Steven Spielberg will win (since, you know, he’s Steven Spielberg, and it’s been a while since he won. Poor little guy.)

Animated Feature

Another tossup between Wreck-it Ralph (47%) and Brave (40%). Both BAFTA and the critics choice awards tend to predict the correct winner about 90% of the time but, this year, they awarded different films (BAFTA->Brave, and Critics Choice -> Wreck-It Ralph). I love Pixar, but their recent movies aren’t as good as they were 5 years ago, and I loved Wreck It Ralph. I’m rooting for that movie, and its sweet 80s Nintendo soundtrack.

Foreign Film

Amour (56%)

A Royal Affair (15%)

Historically, this is a hard award to predict from precursor ceremonies. This year Amour won BAFTA, the Critics Choice Award, and the Golden Globe, so I think its odds are pretty good. It is also nominated for Best Picture, for which it doesn’t stand a chance. I think voters will feel bad and give it the Foreign Oscar instead.

Original Screenplay

Django Unchained (73%) (58%)

Flight (15%) Zero Dark 30 (36%)

Update (Feb 24, 7PM): I noticed an error in my Writers Guild Award data (1 hour before the ceremony!). I had incorrectly stored the Original Screenplay winner as Flight instead of Zero Dark 30, and the Adapted category as The Silver Linings Playbook instead of Argo. This changes the predictions in the two writing categories

Adapted Screenplay

The Silver Linings Playbook (65%) Argo (65%)

Argo, Lincoln (12% each) Lincoln, Silver Linings Playbook (11% each)

Update: See above.


Thats about it for the major-ish categories. Most of the minor categories don’t have many equivalent awards in other ceremonies, so the model predictions aren’t very compelling (sorry, lady nominated for Best Makeup and Hairstyling)

This was an interesting exercise — one of the key lessons (if you see this stuff as teaching moment material) is that there is a fairly high amount of unpredictability in predicting Oscar winners based on the other awards — typical forecast accuracies are around 60% (clearly better than random guessing from a field of 5-7 nominees, but nowhere near a lock).

Perhaps models with more information could do better (genre information seems particularly relevant). However, even the people who do this stuff for a living usually only get ~75% of the categories right. So maybe it’s just hard to predict.

Or maybe we just haven’t seen the Nate Silver of Oscar forecasting yet.


Apparently, Nate Silver is the Nate Silver of Oscar forecasting. His method and conclusions are largely the same as what’s posted here. That’s encouraging.

Forecasting the Oscars Like a Boss

For the past eight-ish years, I’ve participated in an Oscar prediction pool with my brothers and some of my friends. We’ve fooled around with a number of different scoring schemes. The version we’ve settled on the past few years has been to sort each prediction by how confident we are, and assign point values to each category accordingly. This seems to keep the contest more interesting longer through the awards broadcast.

These contests have typically resulted in my shame. It turns out that I am terrible at guessing which movies are likely to win Oscars (nevermind the fact that I have NO FREAKING IDEA what the difference between sound editing and sound mixing is, but apparently they each need their own categories. Whatever.). My brother Jon has won handily the past 3 years, and that has to stop.

This year, I resolved to systematize my predictions, by building a model to forecast Oscar results. Why do this, you ask?

1) The lucrative $50 prize, and all of the opportunities that that opens up.

2) I’ve been looking for a non-astrophysics data analysis project, and this seemed fun. Also, astrophysicists throw garbage at you when you try to do inference without a physically-justified model, and I wanted to slum it up a bit with some purely-empirical forecasting.

3) I’ve been wanting to spend more time with the Pandas and Scikit-Learn Python libraries.

The Data

The Academy Awards is the last ceremony in a long awards season. The most obvious data to use to try to forecast the Oscars are the results from these previous ceremonies. This is especially true since the same people vote for multiple ceremonies, so these awards can be seen as polls for the oscars. There are other potentially interesting variables to consider (Rotten Tomatoes rating, Box office performance, genre or cast information, etc), but I decided to start with the ceremony results.

The IMDB archives the nominees and winners for the major ceremonies. They don’t provide a nice API for grabbing their data, but it’s easy enough to parse from the HTML (enter: BeautifulSoup). As is so often the case, data cleaning is among the most time-consuming tasks of the project. For example, many award categories change names slightly over time (Best Picture -> Best Motion Picture of the Year). After standardizing all of this information, I ended up with a JSON database of all the nominees and winners for 7 awards ceremonies since 1990 (about 8800 nominations in total).


Before doing anything fancy, I wanted to get a grasp on what the data looked like.  T0 be concrete, I’ll focus on the Best Picture category for the moment. Here’s a plot of what fraction of movies go on to win the best picture Oscar, as a function of whether they were nominated for or won the award in a different ceremony.

Correlation between best picture nominees/winners for the Oscars and other ceremonies

Correlation between best picture nominees/winners for the Oscars and other ceremonies

Interestingly (though not all that surprising in retrospect), winning the Independent Spirit Award (which focuses on indie cinema) is anti-correlated with winning the Oscar. The only movie since 1990 to win both the Independent Spirit and Oscar Best Picture awards was The Artist last year. Also interesting is the fact that the Golden Globes correlates so weakly with the Oscars — this ceremony is often touted as being a good predictor for the Academy Awards, but other ceremonies clearly do better.


How do I combine all of this information into an estimate of who is most likely to win the 2013 Best Picture award? Equally as important, how do I assess how confident this prediction is, so that I can wager more points on categories which are most certain?

One of the standard strategies for estimating success probabilities based on a 0/1 outcome (a movie either wins, or it doesn’t) is logistic regression. It’s pretty much the simplest thing to try, so it’s worth checking how well it does.

There are two wrinkles to this model:

  1.  It doesn’t account for the fact that exactly one nominee in each category wins. I address this by re-normalizing the probabilities in the model within each category, and adjusting the likelihood calculation of the data given the model accordingly. This makes finding the optimal model slightly harder, but the dataset is small enough that computing power isn’t much of  a concern.
  2. With only 23 years of historical data, over-fitting is a danger. To address this, I ran a regularized regression. That is, instead of fitting the model by maximizing the likelihood, I maximize a modified likelihood that penalizes Logistic Regression models with large coefficients. The size of coefficients in a Logistic Regression model directly relates to how confident predictions are, so the penalty acts to make the model more conservative, and less likely to draw too-strong a conclusion from a small training dataset. The strength of the penalty is chosen by cross-validation.

The scikit-learn library is really wonderful for this kind of work. First, they provide a lot of functionality out-of-the-box (optimization, cross validation, and implementations of dozens of models). Furthermore, the API is extremely consistent, so that you can build your own custom classifiers, using scikit-learn objects as building blocks. I definitely plan on using it more, even for more vanilla model fitting and optimization tasks (its API is way better than most of SciPy in my opinion).

There are a number of criteria to evaluate whether this model is a good fit to the data.

How accurate is it?

This model correctly classifies about 75% of the best picture winners since 1990. Furthermore, the years it fails usually correspond to notable upsets; for example, Crash unexpectedly won Best Picture in 2006, despite Brokeback Mountain being a strong favorite. Brokeback Mountain won best picture in every other ceremony in the database, and Crash is the only movie that won the Oscar without even being nominated for the Golden Globe. Other notable upsets include 1999 (when Shakespeare in Love beat Saving Private Ryan) and 1996 (Braveheart won, the model predicted Apollo 13). I see these upsets as indicative of the inherent uncertainty in trying to predict Oscar winners based on other ceremonies.

How representative is the data?

One nice property about the model is that it is generative — you can use it to simulate hypothetical outcomes for each year, based on the information provided from the other ceremonies. If the model is a fair representation of the data, then the actual outcome should look like these simulated outcomes (if it doesn’t, this suggests an over-fit or mis-specified model).

Monte-Carlo Simulation from the Best Picture Forecast Model

Monte-Carlo Simulation from the Best Picture Forecast Model

The left plot shows, for each of 1000 hypothetical results of 23 years of best picture awards, how many years were correctly predicted by the model. The black line is the actual data, and falls near the typical value of this distribution. That’s encouraging.

The right plot is slightly more discerning. It plots, as a function of the confidence at which each prediction is made, how many mistakes were made which had confidences less than this threshold. If these confidences are right, then the model should make most of it’s mistakes at lower confidences — that is, the curve should rise steeply on the left and then level off (I used a similar plot when talking about Nate Silver’s election forecast results). The black lines show 500 simulations, and the red line shows the actual data. Again, it’s encouraging that it runs through the collection of black curves.

This fairly simple model seems to do a pretty good job at characterizing the outcome of the Best Picture category for the past 2 decades. It’s possible that a better model (or additional information about each nominee) could make the predictions more precise (i.e. skew the blue histogram to the right, and make it narrower). I may look into this in the coming weeks. In any event, I’ll be able to make predictions about the 2013 Oscars once the other award ceremonies happen. Look for more blog posts.

I’m coming for you, Jon.

Nate Silver Was Right

Nate Silver made a lot of testable predictions about the election on his 538 blog. In particular, he predicted the winner of each state (and DC), and placed a confidence percentage on each prediction. He did the same for senate races. In total, that’s 84 predictions with confidence estimates.

As in 2008, his predictions were phenomenal. Some of the races are not yet decided, but it looks like all of his presidential predictions were correct (Florida is not yet called as I write this), as well as all but perhaps 2 of his senate predictions (Democrat candidates are unexpectedly leading the Montana and North Dakota races). There are plenty of pundits who were predicting very different results.

Granted, while 82-84 correct predictions sounds (and is) amazing, many of those were no brainers. Romney was always going to win Texas, just as Obama was a sure bet in New York. A slightly harder test is whether his uncertainty model is consistent with the election outcome.

Let’s simulate 1000 elections. For each race, we assume (for the moment) Silver’s uncertainties are correct. That is, if he called a race at a confidence of x%, then we assume the prediction should be wrong 100-x% of the time.

Number of errors in 1000 simulated elections (red, shown with jitter) as a function of prediction confidence level

This plot shows, for each simulated election (in red), the total number of mis-predicted races with prediction confidences greater than the threshold on the x axis. The left edge of the plot gives the total number of mis-predictions (at any confidence). The half-way point shows the number of errors for predictions with confidences greater than 75%.

The lines all go down with increasing confidence — that is, there are fewer expected errors for high-confidence predictions (like Texas or New York). I’ve added some random jitter to each line, so they don’t overlap so heavily. The grey bands trace the central 40% and 80% of the simulations. The thick black line is the average outcome.

This plot summarizes the number of mis-classifications you would expect from Nate Silver’s 538 blog, given his uncertainty estimates. A result that falls substantially above the gray bands would indicate too many mistakes, and too-optimistic a confidence model. Lines below the bands indicate not enough mistakes, or too pessimistic a model.

If we assume that the North Dakota and Montana senate races end up as upsets, here is Nate Silver’s performance:

Nate Silver’s actual performance, assuming ND and Montana senate races are Democrat upsets

He did, in fact, do slightly better than expected (I doubt he’ll lose sleep over that). This result is broadly consistent with what we should expect if Silver’s model is correct. On the other hand, consider what happens if he ends up correctly predicting these two senate races. It’s unlikely that Nate Silver should have predicted every race correctly, given his uncertainty estimates (this happens in about 2% of simulated elections). It’s possible that Silver will actually tighten up his uncertainty estimates next election.

In any event, I think he knows what he’s talking about. I’m reminded of this clip (Nate Silver is Jeff Goldblum. The rest of the world is Will Smith)

Jobs Added Under Different Presidents

In his speech at the Democratic National Convention, Bill Clinton made the following claim

Well, since 1961, for 52 years now, the Republicans have held the White House 28 years, the Democrats, 24. In those 52 years, our private economy has produced 66 million private sector jobs.

So what’s the job score? Republicans, 24 million; Democrats, 42 (million)

The fact-checking site Politifact was quick to verify his assertion, and also provided a few caveats about the figure — namely, that Presidents probably don’t deserve as much credit or blame for this number as they are given. Nevertheless, I wanted to see the breakdown for myself.

The Bureau of Labor Statistics is great for data like this, and I appreciate that our government collects and distributes such data. I took a look at the non-government employment rates that Clinton’s claim is based on (this is the relevant table). First, the raw employment figure from 1961 until today:

US Employment (Non-Government, Seasonally-Adjusted)

Next, color-coded by the sitting president’s party

US Employment by president

Next, the difference in employment from the day each president took office

Jobs added or lost under Presidents since 1961

And shown on top of each other, with the net change:

Jobs added or lost by presidents since 1961

Given the current rhetoric, I was a little surprised at how similar President Obama’s line (lowest blue one) is to President Reagan’s (the highest red one). The turnaround under Obama’s presidency has been slower, but now seems to be improving at a rate comparable to Reagan and Clinton (highest line).

The Aspects of Astronomy in the Cloud That Scare Me

I  spent the last two days in a very interesting discussion group about visualization challenges for ALMA. ALMA is arguably the first observatory where the data products will routinely lie in “big data” territory — that is, the Gigabyte-Terabyte range where data sets can’t easily be analyzed on a single machine. We’ve created observational datasets this large before, but they have arguably been niche products that only a few researchers use in their entirety (large swaths of the entire 2MASS or Sloan surveys, for example). Many, many people who use ALMA data will have to contend with data sizes >> RAM. The community needs to come up with solutions for people to work with these data products.

The big theme at this discussion group was moving visualization and analysis to the cloud, where more numerous and powerful computers crunch through mammoth files, and astronomers interact with this resource through some kind of web service. We spent a lot of time looking at a nice data viewer and infrastructure developed in Canada that is great for browsing through 100GB (and larger) image cubes.  Yet I find myself uneasy about this move to the cloud. I seemed to be in the minority within the group, as most others embraced or accepted this methodology as the inevitable future of data interaction in astronomy (I may or may not have been called a dinosaur — admittedly, I was being a bit obnoxious about my point!).

I get that cloud computing is unavoidable at some level — most astronomers do not have nearly enough computational resources or knowledge to tackle Terabyte image cubes, and we will need to rely on a centralized infrastructure for our big data needs. Centralized resources are also great for community science, where lots of people need to work on the same data. But in an attempt to defend (or at least define) my dinosaur attitudes, here are the issues that I think astronomy cloud computing needs to address:

Scope of access: How often and to what extent will an observer have access to cloud resources? Will she be able to visualize data whenever she wants? Will she be able to run arbitrary computation? How much of a lag will there be between requests and results? Many of us are used to a tight feedback cycle when visualizing, analyzing and interpreting data. Is it a priority to preserve this workflow? Is that technologically and financially feasible?

Style of access: How many ways will we be able to interact with data? What restrictions will be placed on the computation and visualizations we undertake? Will we be able to download smaller sections of the data product for exploration offline? Will this API be in a convenient form (python library, RESTful URL, SQL) or some more awkward solution (custom VO protocol, cluttered web form)? What will the balance be between GUI and programmatic access? How well will each be designed and supported (personally, I can tolerate a poor GUI interface much more than a bad programming library)?

Bottlenecks for single machines. Underlying all of this is is the assumption that it is impossible to work with ALMA data on local machines. I think this is overhyped in some aspects. Storing even a Terabyte of data is trivial (1 Tb hard drives are $100, compared to $2000 per year to store 1 TB on Amazon’s cloud, to say nothing of computation). While churning through all of this data is certainly a many-hour task with a single disk, many operations relevant for visualization, exploration, and simple analysis are trivial (extracting profiles, slices, and postage stamps on a properly indexed data cube is very cheap, and gives you a lot of power to understand data and develop analysis plans). Should we really fully abandon this workflow that almost all astronomers currently use? Is it worth developing new software to help interact with local data more easily?

By no means are these issues insurmountable, and I was probably sweating the details too much for the high-level discussion at the meeting. But the details do matter, and the Astronomical community has had a mixed track record with creating interfaces to remote data products (new visualization clients are getting pretty good, but services for analysis or data retrieval are still pretty cumbersome). My reaction to most of these clumsy products has been to avoid them, because it has been possible to fetch and analyze the data myself. Once we lose that ability, we will all become very dependent on external services. At that point, the details of remote data interfaces may become the new bottleneck for discovery.

RAWRRRR (dinosaur noises)

Critiquing the Divorce Post

Update: Paul Van Slembrouck, the designer of this graphic, has responded to the critique. Be sure to read his comments below!

I am a teaching fellow for a class at Harvard called “The Art of Numbers,” which teaches principles of data presentation to undergraduates from all concentrations. For a recent midterm, students were asked to analyze this graphic from

Distribution of education levels for women who divorced in 2008

For valentine’s day posted a series of visualizations of divorce statistics in the U.S.. Several aspects about this graph bothered me, and I thought it would make for a good exam question.

Read the rest of this entry »

Interpreting the Hockey Stick Diagram

The Hockey Stick Plot is one of the most iconic and controversial plots related to climate change. It shows the change in average temperature over the past few thousand years. Here’s a version from Wikipedia

Temperature Change Over Time

One little aspect of this graph irks me. Let me be clear up front: I am not questioning the science behind this plot, or the conclusions drawn from it. I tend to trust consensuses (consensi?) in the scientific community. I also tend to think that people who accuse scientists of swindling the public for their own personal gain don’t understand the attitudes within the scientific community.

My little problem

The most striking feature of this plot is the rise in temperature over the last 150 years or so — the scale of this change is larger than other natural variations on ~100 year timescales, and strongly suggests an external influence (i.e. humans).

My problem is that the hockey stick diagram is often used to implicate the Industrial Revolution of the 1800s. After all, the knee in this diagram occurs right around 1800. David MacKay, in his great book on energy consumpation, even goes so far as to label the year that James Watt invented his steam engine (note he’s using a different proxy for climate change — CO2 concentration instead of temperature change):

From David Mackay's "Sustainable Energy -- Without the Hot Air".

I don’t doubt that the Industrial Revolution marks a significant milestone in human climate change. However, I am less convinced that these diagrams really show that.

Here’s my reasoning. Human population growth has been roughly exponential over time:

Human Population Growth

Human Population Growth Over Time (Data from Wikipedia)

This is slightly steeper than exponential, but there’s no sharp knee at 1800. Likewise, most of the things that humans produce (and pollute with) have also grown exponentially over time — electricity, computers, tires, etc. Human growth has traditionally been exponential.

Given this simple observation, my naive intuition would be that the historical temperature record would be broadly described by the sum of two trends: a flat line representative of the earth’s equilibrium temperature, and an exponential curve that encapsulates the growing impact of humans.

This simple model can reproduce the general shape of the hockey stick pretty well:

Black: Historical Climate Data (Jones and Mann 2004). Red: Fit using a constant + exponential

This model doesn’t account for the bumps and wiggles (due to natural climate oscillations about the equilibrium). But the key point here is this: neither term in this model has a characteristic time scale. The “knee” in the graph represents the time when the exponential human-factor starts to overwhelm the constant term, but there’s nothing special about how the ‘human factor’ is changing around the time of the Industrial Revolution.

Perhaps it’s a nitpicky point to make, but the hockey-stick diagram on its own doesn’t isolate the Industrial Revolution as the cause of human-induced climate change (other analyses might, of course). Instead, it points to 1800 as the time when human growth (industrial, agricultural, whatever) became significant on a global scale. A more convincing indictment of the Industrial Revolution (and not population growth in general) would isolate the human contribution — and show a knee around 1800. This would more directly show that ‘something changed’ in a distinct way when the Industrial Revolution began.

Please don’t quote me as saying the hockey stick diagram is wrong. I’m talking to you, Rick Perry.