The Great Oscar Postmortem

Well, the Academy Awards are over (congrats to the winners; I know you’ve all been eagerly awaiting my approval) — how did my grand forecasting experiment turn out?

For some context, I’ll compare my forecast to some other statistically driven models: Nate Silver’s forecast, and the 3 other models covered along with mine in the Wall Street Journal. Briefly:

Nate Silver and Ben Zauzmer developed a model largely similar to mine, making predictions by looking at precursor awards ceremonies, and weighting each ceremony by its historical accuracy. Not surprisingly, our predictions are very similar.

David Rothschild from Microsoft Research used a fancier / more mysterious combination of market data and crowdsourcing to make predictions. Most of his predictions are quoted at high confidence. He was also able to make predictions for all 24 categories (including the 3-4 categories where it’s hard to find relevant data from precursor ceremonies).

Farsite also seemed to aggregate information from a broader pool than precursor ceremonies. Their predictions tended to fall between the “precursor models” and the Microsoft model.

I wrote about my predictions in the major categories on this blog, and published my final set of predictions on a google spreadsheet and NYTimes ballot before the ceremony. Unfortunately, some last-minute debugging changed one of my screenwriting predictions an hour before the ceremony (the revised prediction was correct, the original was wrong). Since I had previously published a different prediction, I will count that category as a mis-prediction here. Note also that I skipped the 3 short-subject categories and Production Design category, as I didn’t have enough relevant data to make a prediction. I’ve excluded those categories from the following analysis.

Here are the results

Top 6 Categories (Directing, Acting, Picture)

Beaumont: 5/6

Zauzmer: 5/6

Farsite: 5/6

Nate Silver: 4/6

Microsoft Research: 4/6

Only Farsite correctly predicted that Christoph Waltz would win for Best Supporting Actor. Ben and I lucked out when Ang Lee wonBest Director (everyone else predicted a Spielberg win). We both claimed this category was essentially a toss-up, so we can’t claim to be too prescient here. However, our prediction seems more plausible than the Farsite/Microsoft predictions, which had Spielberg as a >3:1 favorite (again, though, this is only 1 data point).

Top 20 Categories (excluding Production Design and short subject categories)

Beaumont: 17/20

Zauzmer: 17/20

Microsoft: 17/20

Farsite/Silver: No Predictions

We each made 1-2 additional mistakes in the minor categories, albeit in different categories: I mis-called the Animated category and (thanks to the bug discussed above) mis-predicted Adapted Screenplay up until an hour before the awards. Ben missed Original Screenplay and Cinematography. Microsoft missed the Makeup Category.

Calibration

The raw accuracy is the most interesting statistic, but not the entire story. A good model should also be calibrated — a prediction made with X% confidence should be correct about X% of the time. Predictions substantially more or less accurate than this are mis-calibrated.

At first glance, it seems that the Microsoft model is best calibrated. The average prediction confidence in this model is about 80%, comparable to the overall accuracy. By contrast, Ben’s and my models makes predictions at ~55% confidence on average. In other words, our models were too conservative, based on how well they did.

In a few previous posts, I’ve discussed a method for visualizing model calibration. In essence, the idea is to use a model’s prediction confidence to simulate outcomes. For each simulation, we can plot the differences between the model prediction and simulated outcome. Finally, we overplot the actual outcome on top of these simulations. If the model is well-calibrated, the “reality line” should overlap the simulations. Let’s take a look at that:

Microsoft Model

Microsoft Model

Beaumont Model

Zauzmer ModelEach line in these plots shows how many mis-predictions were made at a given confidence level or greater (for example, the Microsoft model made 3 mistakes at confidence > 0%, 2 mistakes at confidence >50%, 1 mistake at confidence > 70%, and no mistakes at confidence >~75%). The red lines show 1000 simulations, the black line is the average simulation, the light/dark bands are the central 40%/80% of simulations, and the blue line is the actual performance.

These plots confirm and quantify what I said above — the models that Ben and I put together are too conservative, while the Microsoft model seems nicely calibrated.  I have to admit I am both impressed by how well the “magic” Microsoft model calibrated itself, and curious about the under-confidence of my own model. I’ll be chewing on that in the coming days. Here are a few things I’ll be thinking about

  • My predictions did better than I expected, based on testing on historical data. I was expecting to miss ~5 categories. After catching my screenwriting bug, I only missed 2. That’s mild evidence that this years’ Oscars played out more like the precursor awards, which explains some of the under-confidence.
  • I used regularized regression to optimize my model; the “regularization” means the overall confidence of the model is adjusted up or down to match historical data. I’m starting to wonder if there’s any asymmetry in that process such that, given the relative simplicity/inflexibility of my model, the regularization prefers under-confidence. Who knows.

So… yay math?

So at the end of the day, is this worth it (I’m asking in the sense of ‘is there a significant edge to model-based predictions’, and not ‘is this a waste of time’)? Certainly, basing predictions off of precursor awards is a huge advantage over ignoring the data — my oscar guesses from previous years were usually <~50% accurate, and not 85%.

What about the harder question of whether modeling is better than the naive strategy of looking at the precursor awards, and predicting whichever film has won the most awards? That simple strategy largely yields the same set of predictions. The few categories where modeling matters are close calls, where there is no obvious nominee with a plurality of precursor wins. The best example from this year was the Best Actress Category; both Jennifer Lawrence and Jessica Chastain won precursor awards. However, the mathematical models all realized that Jennifer Lawrence was more successful in the more influential awards (e.g. SAG), and correctly predicted her as a clear favorite.

Anyways, that’s a rather large brain dump. This was fun. And I FINALLY eeked out an Oscar pool victory against my brother. Mission accomplished. Next year, I’m going after Microsoft.


Nate Silver Was Right

Nate Silver made a lot of testable predictions about the election on his 538 blog. In particular, he predicted the winner of each state (and DC), and placed a confidence percentage on each prediction. He did the same for senate races. In total, that’s 84 predictions with confidence estimates.

As in 2008, his predictions were phenomenal. Some of the races are not yet decided, but it looks like all of his presidential predictions were correct (Florida is not yet called as I write this), as well as all but perhaps 2 of his senate predictions (Democrat candidates are unexpectedly leading the Montana and North Dakota races). There are plenty of pundits who were predicting very different results.

Granted, while 82-84 correct predictions sounds (and is) amazing, many of those were no brainers. Romney was always going to win Texas, just as Obama was a sure bet in New York. A slightly harder test is whether his uncertainty model is consistent with the election outcome.

Let’s simulate 1000 elections. For each race, we assume (for the moment) Silver’s uncertainties are correct. That is, if he called a race at a confidence of x%, then we assume the prediction should be wrong 100-x% of the time.

Number of errors in 1000 simulated elections (red, shown with jitter) as a function of prediction confidence level

This plot shows, for each simulated election (in red), the total number of mis-predicted races with prediction confidences greater than the threshold on the x axis. The left edge of the plot gives the total number of mis-predictions (at any confidence). The half-way point shows the number of errors for predictions with confidences greater than 75%.

The lines all go down with increasing confidence — that is, there are fewer expected errors for high-confidence predictions (like Texas or New York). I’ve added some random jitter to each line, so they don’t overlap so heavily. The grey bands trace the central 40% and 80% of the simulations. The thick black line is the average outcome.

This plot summarizes the number of mis-classifications you would expect from Nate Silver’s 538 blog, given his uncertainty estimates. A result that falls substantially above the gray bands would indicate too many mistakes, and too-optimistic a confidence model. Lines below the bands indicate not enough mistakes, or too pessimistic a model.

If we assume that the North Dakota and Montana senate races end up as upsets, here is Nate Silver’s performance:

Nate Silver’s actual performance, assuming ND and Montana senate races are Democrat upsets

He did, in fact, do slightly better than expected (I doubt he’ll lose sleep over that). This result is broadly consistent with what we should expect if Silver’s model is correct. On the other hand, consider what happens if he ends up correctly predicting these two senate races. It’s unlikely that Nate Silver should have predicted every race correctly, given his uncertainty estimates (this happens in about 2% of simulated elections). It’s possible that Silver will actually tighten up his uncertainty estimates next election.

In any event, I think he knows what he’s talking about. I’m reminded of this clip (Nate Silver is Jeff Goldblum. The rest of the world is Will Smith)