The Great Oscar Postmortem

Well, the Academy Awards are over (congrats to the winners; I know you’ve all been eagerly awaiting my approval) — how did my grand forecasting experiment turn out?

For some context, I’ll compare my forecast to some other statistically driven models: Nate Silver’s forecast, and the 3 other models covered along with mine in the Wall Street Journal. Briefly:

Nate Silver and Ben Zauzmer developed a model largely similar to mine, making predictions by looking at precursor awards ceremonies, and weighting each ceremony by its historical accuracy. Not surprisingly, our predictions are very similar.

David Rothschild from Microsoft Research used a fancier / more mysterious combination of market data and crowdsourcing to make predictions. Most of his predictions are quoted at high confidence. He was also able to make predictions for all 24 categories (including the 3-4 categories where it’s hard to find relevant data from precursor ceremonies).

Farsite also seemed to aggregate information from a broader pool than precursor ceremonies. Their predictions tended to fall between the “precursor models” and the Microsoft model.

I wrote about my predictions in the major categories on this blog, and published my final set of predictions on a google spreadsheet and NYTimes ballot before the ceremony. Unfortunately, some last-minute debugging changed one of my screenwriting predictions an hour before the ceremony (the revised prediction was correct, the original was wrong). Since I had previously published a different prediction, I will count that category as a mis-prediction here. Note also that I skipped the 3 short-subject categories and Production Design category, as I didn’t have enough relevant data to make a prediction. I’ve excluded those categories from the following analysis.

Here are the results

Top 6 Categories (Directing, Acting, Picture)

Beaumont: 5/6

Zauzmer: 5/6

Farsite: 5/6

Nate Silver: 4/6

Microsoft Research: 4/6

Only Farsite correctly predicted that Christoph Waltz would win for Best Supporting Actor. Ben and I lucked out when Ang Lee wonBest Director (everyone else predicted a Spielberg win). We both claimed this category was essentially a toss-up, so we can’t claim to be too prescient here. However, our prediction seems more plausible than the Farsite/Microsoft predictions, which had Spielberg as a >3:1 favorite (again, though, this is only 1 data point).

Top 20 Categories (excluding Production Design and short subject categories)

Beaumont: 17/20

Zauzmer: 17/20

Microsoft: 17/20

Farsite/Silver: No Predictions

We each made 1-2 additional mistakes in the minor categories, albeit in different categories: I mis-called the Animated category and (thanks to the bug discussed above) mis-predicted Adapted Screenplay up until an hour before the awards. Ben missed Original Screenplay and Cinematography. Microsoft missed the Makeup Category.

Calibration

The raw accuracy is the most interesting statistic, but not the entire story. A good model should also be calibrated — a prediction made with X% confidence should be correct about X% of the time. Predictions substantially more or less accurate than this are mis-calibrated.

At first glance, it seems that the Microsoft model is best calibrated. The average prediction confidence in this model is about 80%, comparable to the overall accuracy. By contrast, Ben’s and my models makes predictions at ~55% confidence on average. In other words, our models were too conservative, based on how well they did.

In a few previous posts, I’ve discussed a method for visualizing model calibration. In essence, the idea is to use a model’s prediction confidence to simulate outcomes. For each simulation, we can plot the differences between the model prediction and simulated outcome. Finally, we overplot the actual outcome on top of these simulations. If the model is well-calibrated, the “reality line” should overlap the simulations. Let’s take a look at that:

Microsoft Model

Microsoft Model

Beaumont Model

Zauzmer ModelEach line in these plots shows how many mis-predictions were made at a given confidence level or greater (for example, the Microsoft model made 3 mistakes at confidence > 0%, 2 mistakes at confidence >50%, 1 mistake at confidence > 70%, and no mistakes at confidence >~75%). The red lines show 1000 simulations, the black line is the average simulation, the light/dark bands are the central 40%/80% of simulations, and the blue line is the actual performance.

These plots confirm and quantify what I said above — the models that Ben and I put together are too conservative, while the Microsoft model seems nicely calibrated.  I have to admit I am both impressed by how well the “magic” Microsoft model calibrated itself, and curious about the under-confidence of my own model. I’ll be chewing on that in the coming days. Here are a few things I’ll be thinking about

  • My predictions did better than I expected, based on testing on historical data. I was expecting to miss ~5 categories. After catching my screenwriting bug, I only missed 2. That’s mild evidence that this years’ Oscars played out more like the precursor awards, which explains some of the under-confidence.
  • I used regularized regression to optimize my model; the “regularization” means the overall confidence of the model is adjusted up or down to match historical data. I’m starting to wonder if there’s any asymmetry in that process such that, given the relative simplicity/inflexibility of my model, the regularization prefers under-confidence. Who knows.

So… yay math?

So at the end of the day, is this worth it (I’m asking in the sense of ‘is there a significant edge to model-based predictions’, and not ‘is this a waste of time’)? Certainly, basing predictions off of precursor awards is a huge advantage over ignoring the data — my oscar guesses from previous years were usually <~50% accurate, and not 85%.

What about the harder question of whether modeling is better than the naive strategy of looking at the precursor awards, and predicting whichever film has won the most awards? That simple strategy largely yields the same set of predictions. The few categories where modeling matters are close calls, where there is no obvious nominee with a plurality of precursor wins. The best example from this year was the Best Actress Category; both Jennifer Lawrence and Jessica Chastain won precursor awards. However, the mathematical models all realized that Jennifer Lawrence was more successful in the more influential awards (e.g. SAG), and correctly predicted her as a clear favorite.

Anyways, that’s a rather large brain dump. This was fun. And I FINALLY eeked out an Oscar pool victory against my brother. Mission accomplished. Next year, I’m going after Microsoft.


Forecasting the Oscars Like a Boss

For the past eight-ish years, I’ve participated in an Oscar prediction pool with my brothers and some of my friends. We’ve fooled around with a number of different scoring schemes. The version we’ve settled on the past few years has been to sort each prediction by how confident we are, and assign point values to each category accordingly. This seems to keep the contest more interesting longer through the awards broadcast.

These contests have typically resulted in my shame. It turns out that I am terrible at guessing which movies are likely to win Oscars (nevermind the fact that I have NO FREAKING IDEA what the difference between sound editing and sound mixing is, but apparently they each need their own categories. Whatever.). My brother Jon has won handily the past 3 years, and that has to stop.

This year, I resolved to systematize my predictions, by building a model to forecast Oscar results. Why do this, you ask?

1) The lucrative $50 prize, and all of the opportunities that that opens up.

2) I’ve been looking for a non-astrophysics data analysis project, and this seemed fun. Also, astrophysicists throw garbage at you when you try to do inference without a physically-justified model, and I wanted to slum it up a bit with some purely-empirical forecasting.

3) I’ve been wanting to spend more time with the Pandas and Scikit-Learn Python libraries.

The Data

The Academy Awards is the last ceremony in a long awards season. The most obvious data to use to try to forecast the Oscars are the results from these previous ceremonies. This is especially true since the same people vote for multiple ceremonies, so these awards can be seen as polls for the oscars. There are other potentially interesting variables to consider (Rotten Tomatoes rating, Box office performance, genre or cast information, etc), but I decided to start with the ceremony results.

The IMDB archives the nominees and winners for the major ceremonies. They don’t provide a nice API for grabbing their data, but it’s easy enough to parse from the HTML (enter: BeautifulSoup). As is so often the case, data cleaning is among the most time-consuming tasks of the project. For example, many award categories change names slightly over time (Best Picture -> Best Motion Picture of the Year). After standardizing all of this information, I ended up with a JSON database of all the nominees and winners for 7 awards ceremonies since 1990 (about 8800 nominations in total).

Exploration

Before doing anything fancy, I wanted to get a grasp on what the data looked like.  T0 be concrete, I’ll focus on the Best Picture category for the moment. Here’s a plot of what fraction of movies go on to win the best picture Oscar, as a function of whether they were nominated for or won the award in a different ceremony.

Correlation between best picture nominees/winners for the Oscars and other ceremonies

Correlation between best picture nominees/winners for the Oscars and other ceremonies

Interestingly (though not all that surprising in retrospect), winning the Independent Spirit Award (which focuses on indie cinema) is anti-correlated with winning the Oscar. The only movie since 1990 to win both the Independent Spirit and Oscar Best Picture awards was The Artist last year. Also interesting is the fact that the Golden Globes correlates so weakly with the Oscars — this ceremony is often touted as being a good predictor for the Academy Awards, but other ceremonies clearly do better.

Modeling

How do I combine all of this information into an estimate of who is most likely to win the 2013 Best Picture award? Equally as important, how do I assess how confident this prediction is, so that I can wager more points on categories which are most certain?

One of the standard strategies for estimating success probabilities based on a 0/1 outcome (a movie either wins, or it doesn’t) is logistic regression. It’s pretty much the simplest thing to try, so it’s worth checking how well it does.

There are two wrinkles to this model:

  1.  It doesn’t account for the fact that exactly one nominee in each category wins. I address this by re-normalizing the probabilities in the model within each category, and adjusting the likelihood calculation of the data given the model accordingly. This makes finding the optimal model slightly harder, but the dataset is small enough that computing power isn’t much of  a concern.
  2. With only 23 years of historical data, over-fitting is a danger. To address this, I ran a regularized regression. That is, instead of fitting the model by maximizing the likelihood, I maximize a modified likelihood that penalizes Logistic Regression models with large coefficients. The size of coefficients in a Logistic Regression model directly relates to how confident predictions are, so the penalty acts to make the model more conservative, and less likely to draw too-strong a conclusion from a small training dataset. The strength of the penalty is chosen by cross-validation.

The scikit-learn library is really wonderful for this kind of work. First, they provide a lot of functionality out-of-the-box (optimization, cross validation, and implementations of dozens of models). Furthermore, the API is extremely consistent, so that you can build your own custom classifiers, using scikit-learn objects as building blocks. I definitely plan on using it more, even for more vanilla model fitting and optimization tasks (its API is way better than most of SciPy in my opinion).

There are a number of criteria to evaluate whether this model is a good fit to the data.

How accurate is it?

This model correctly classifies about 75% of the best picture winners since 1990. Furthermore, the years it fails usually correspond to notable upsets; for example, Crash unexpectedly won Best Picture in 2006, despite Brokeback Mountain being a strong favorite. Brokeback Mountain won best picture in every other ceremony in the database, and Crash is the only movie that won the Oscar without even being nominated for the Golden Globe. Other notable upsets include 1999 (when Shakespeare in Love beat Saving Private Ryan) and 1996 (Braveheart won, the model predicted Apollo 13). I see these upsets as indicative of the inherent uncertainty in trying to predict Oscar winners based on other ceremonies.

How representative is the data?

One nice property about the model is that it is generative — you can use it to simulate hypothetical outcomes for each year, based on the information provided from the other ceremonies. If the model is a fair representation of the data, then the actual outcome should look like these simulated outcomes (if it doesn’t, this suggests an over-fit or mis-specified model).

Monte-Carlo Simulation from the Best Picture Forecast Model

Monte-Carlo Simulation from the Best Picture Forecast Model

The left plot shows, for each of 1000 hypothetical results of 23 years of best picture awards, how many years were correctly predicted by the model. The black line is the actual data, and falls near the typical value of this distribution. That’s encouraging.

The right plot is slightly more discerning. It plots, as a function of the confidence at which each prediction is made, how many mistakes were made which had confidences less than this threshold. If these confidences are right, then the model should make most of it’s mistakes at lower confidences — that is, the curve should rise steeply on the left and then level off (I used a similar plot when talking about Nate Silver’s election forecast results). The black lines show 500 simulations, and the red line shows the actual data. Again, it’s encouraging that it runs through the collection of black curves.

This fairly simple model seems to do a pretty good job at characterizing the outcome of the Best Picture category for the past 2 decades. It’s possible that a better model (or additional information about each nominee) could make the predictions more precise (i.e. skew the blue histogram to the right, and make it narrower). I may look into this in the coming weeks. In any event, I’ll be able to make predictions about the 2013 Oscars once the other award ceremonies happen. Look for more blog posts.

I’m coming for you, Jon.