Forecasting the Oscars Like a Boss

For the past eight-ish years, I’ve participated in an Oscar prediction pool with my brothers and some of my friends. We’ve fooled around with a number of different scoring schemes. The version we’ve settled on the past few years has been to sort each prediction by how confident we are, and assign point values to each category accordingly. This seems to keep the contest more interesting longer through the awards broadcast.

These contests have typically resulted in my shame. It turns out that I am terrible at guessing which movies are likely to win Oscars (nevermind the fact that I have NO FREAKING IDEA what the difference between sound editing and sound mixing is, but apparently they each need their own categories. Whatever.). My brother Jon has won handily the past 3 years, and that has to stop.

This year, I resolved to systematize my predictions, by building a model to forecast Oscar results. Why do this, you ask?

1) The lucrative $50 prize, and all of the opportunities that that opens up.

2) I’ve been looking for a non-astrophysics data analysis project, and this seemed fun. Also, astrophysicists throw garbage at you when you try to do inference without a physically-justified model, and I wanted to slum it up a bit with some purely-empirical forecasting.

3) I’ve been wanting to spend more time with the Pandas and Scikit-Learn Python libraries.

The Data

The Academy Awards is the last ceremony in a long awards season. The most obvious data to use to try to forecast the Oscars are the results from these previous ceremonies. This is especially true since the same people vote for multiple ceremonies, so these awards can be seen as polls for the oscars. There are other potentially interesting variables to consider (Rotten Tomatoes rating, Box office performance, genre or cast information, etc), but I decided to start with the ceremony results.

The IMDB archives the nominees and winners for the major ceremonies. They don’t provide a nice API for grabbing their data, but it’s easy enough to parse from the HTML (enter: BeautifulSoup). As is so often the case, data cleaning is among the most time-consuming tasks of the project. For example, many award categories change names slightly over time (Best Picture -> Best Motion Picture of the Year). After standardizing all of this information, I ended up with a JSON database of all the nominees and winners for 7 awards ceremonies since 1990 (about 8800 nominations in total).


Before doing anything fancy, I wanted to get a grasp on what the data looked like.  T0 be concrete, I’ll focus on the Best Picture category for the moment. Here’s a plot of what fraction of movies go on to win the best picture Oscar, as a function of whether they were nominated for or won the award in a different ceremony.

Correlation between best picture nominees/winners for the Oscars and other ceremonies

Correlation between best picture nominees/winners for the Oscars and other ceremonies

Interestingly (though not all that surprising in retrospect), winning the Independent Spirit Award (which focuses on indie cinema) is anti-correlated with winning the Oscar. The only movie since 1990 to win both the Independent Spirit and Oscar Best Picture awards was The Artist last year. Also interesting is the fact that the Golden Globes correlates so weakly with the Oscars — this ceremony is often touted as being a good predictor for the Academy Awards, but other ceremonies clearly do better.


How do I combine all of this information into an estimate of who is most likely to win the 2013 Best Picture award? Equally as important, how do I assess how confident this prediction is, so that I can wager more points on categories which are most certain?

One of the standard strategies for estimating success probabilities based on a 0/1 outcome (a movie either wins, or it doesn’t) is logistic regression. It’s pretty much the simplest thing to try, so it’s worth checking how well it does.

There are two wrinkles to this model:

  1.  It doesn’t account for the fact that exactly one nominee in each category wins. I address this by re-normalizing the probabilities in the model within each category, and adjusting the likelihood calculation of the data given the model accordingly. This makes finding the optimal model slightly harder, but the dataset is small enough that computing power isn’t much of  a concern.
  2. With only 23 years of historical data, over-fitting is a danger. To address this, I ran a regularized regression. That is, instead of fitting the model by maximizing the likelihood, I maximize a modified likelihood that penalizes Logistic Regression models with large coefficients. The size of coefficients in a Logistic Regression model directly relates to how confident predictions are, so the penalty acts to make the model more conservative, and less likely to draw too-strong a conclusion from a small training dataset. The strength of the penalty is chosen by cross-validation.

The scikit-learn library is really wonderful for this kind of work. First, they provide a lot of functionality out-of-the-box (optimization, cross validation, and implementations of dozens of models). Furthermore, the API is extremely consistent, so that you can build your own custom classifiers, using scikit-learn objects as building blocks. I definitely plan on using it more, even for more vanilla model fitting and optimization tasks (its API is way better than most of SciPy in my opinion).

There are a number of criteria to evaluate whether this model is a good fit to the data.

How accurate is it?

This model correctly classifies about 75% of the best picture winners since 1990. Furthermore, the years it fails usually correspond to notable upsets; for example, Crash unexpectedly won Best Picture in 2006, despite Brokeback Mountain being a strong favorite. Brokeback Mountain won best picture in every other ceremony in the database, and Crash is the only movie that won the Oscar without even being nominated for the Golden Globe. Other notable upsets include 1999 (when Shakespeare in Love beat Saving Private Ryan) and 1996 (Braveheart won, the model predicted Apollo 13). I see these upsets as indicative of the inherent uncertainty in trying to predict Oscar winners based on other ceremonies.

How representative is the data?

One nice property about the model is that it is generative — you can use it to simulate hypothetical outcomes for each year, based on the information provided from the other ceremonies. If the model is a fair representation of the data, then the actual outcome should look like these simulated outcomes (if it doesn’t, this suggests an over-fit or mis-specified model).

Monte-Carlo Simulation from the Best Picture Forecast Model

Monte-Carlo Simulation from the Best Picture Forecast Model

The left plot shows, for each of 1000 hypothetical results of 23 years of best picture awards, how many years were correctly predicted by the model. The black line is the actual data, and falls near the typical value of this distribution. That’s encouraging.

The right plot is slightly more discerning. It plots, as a function of the confidence at which each prediction is made, how many mistakes were made which had confidences less than this threshold. If these confidences are right, then the model should make most of it’s mistakes at lower confidences — that is, the curve should rise steeply on the left and then level off (I used a similar plot when talking about Nate Silver’s election forecast results). The black lines show 500 simulations, and the red line shows the actual data. Again, it’s encouraging that it runs through the collection of black curves.

This fairly simple model seems to do a pretty good job at characterizing the outcome of the Best Picture category for the past 2 decades. It’s possible that a better model (or additional information about each nominee) could make the predictions more precise (i.e. skew the blue histogram to the right, and make it narrower). I may look into this in the coming weeks. In any event, I’ll be able to make predictions about the 2013 Oscars once the other award ceremonies happen. Look for more blog posts.

I’m coming for you, Jon.

The How of Hurricane Response

After a Hurricane or other natural disaster, debris poses a huge problem; it knocks out vital infrastructure like electricity, and blocks roads — inhibiting the work of rescue crews and cutting off victims from access to hospitals, supplies, and evacuation routes. It is vitally important that disaster relief efforts have plans for using their limited resources to efficiently clear debris off roads — giving aid to as many people as possible, as quickly as possible.

This was the scenario the Harvard Institute for Applied Computational Sciences presented to two teams of graduate students last week, as part of their first computational challenge. They were given digitized road maps of Cambridge, MA, information about the population density, and a realistic projection of the road debris that would be left behind after a major Hurricane. Each team was given two weeks to design an algorithm to efficiently clear debris, minimizing the amount of time people are cut off from access to local hospitals. I was a member of one of these teams.

Within this scenario, we have enough resources to clear a limited amount of debris each day. All of the bulldozers start off at two local hospitals (admittedly somewhat contrived — in a real scenario, relief aid would likely work their way in from outside the disaster area). At any given time, we can only clear debris on the roads immediately adjacent to roads that we have already cleared — that is, we can’t magically airdrop bulldozers into the most heavily damaged areas. Instead, we have to clear our way to these areas.

“Solutions” to the problem consist of a schedule of which roads to clear, in which order. Whichever solution is most efficient in giving as many people fast access to hospitals, wins.

Coming up with a Solution

In principle, this problem can be solved very easily — just consider every possible schedule for clearing the roads, and choose the one which restores hospital access most quickly. Unfortunately, this approach is utterly infeasible — our map of Cambridge has 604 road segments, yielding about 604! or 10^{1420} options (thats a 1 with 1420 zeros after it). Even on the fastest computers in the world (or the future, really), this calculation would take far longer than the current age of the universe to complete. We need a more intelligent way of searching through possible solutions.

The approach we came up with turns out to be highly effective. It involves altering a given schedule to generate a new, similar, and possibly better solution. We called this “nudging” the schedule, and it works as follows:

  1. Take an initial schedule that solves the problem, but in a non-optimal way (these are easy to come up with).
  2. Truncate the schedule at some point, keeping only the first N decisions.
  3. Determine which points in the city do not have access to a hospital after these steps. Choose one at random.
  4. Clear out the most efficient path (the minimum-weight path, in Graph-algorithmic jargon) from one of the hospitals to this location. Add these decisions immediately after the partial-schedule from step 2.
  5. Add the rest of the decisions that were truncated in step 2 (paying careful attention to not duplicate any work that you did in step 4).

Nudging solutions has a lot of nice properties: its fast (we can easily do it ~100 times per second with our relatively slow python code), and provides a way to re-prioritize a schedule, since the location chosen step 3 is rescued earlier in the new plan than it was in the old. It isn’t too hard to convince yourself that, with enough nudging, it is possible to arrive at the globally optimum solution from any starting solution. And the choice of using the most efficient path in step 4 tends to create effective schedules which don’t waste resources.

Equipped with a strategy of nudging solutions, there are many algorithms which can search for the best strategy. We chose Simulated Annealing, which works more or less as follows:

  1. Start with a schedule S1
  2. Nudge S1 to generate a new schedule, S2
  3. If S2 is a better schedule, throw away S1
  4. If it is worse, then throw away S2 with some probability related to how much worse of a solution it is
  5. Repat the process with the solution that wasn’t discarded
The rejection probability in step 4 is gruadually lowered throughout the process — at the beginning, almost all solutions are accepted, allowing the algorithm to explore a wide range of possible scenarios. As the probability drops, the solution is gradually confined to better and better solutions.

The algorithmic showdown

After two weeks of development, each team applied their algorithm to a slightly modified map of cambridge. We were given 3 hours of computing time on one of Harvards super-computers to run our algorithm and come up with the best possible strategy.

Our solution turned out to be very effective. After only a few minutes of nudging, we had found a solution better than the competing team’s final answer. Furthermore, our strategy out-performed the solution generated by the competitions’ organizers from Georgia Tech, which was previously thought to be near-optimal.

How nudging solutions with simulated annealing decreases the penalty function, as a function of time. Each black line depicts a series of nudges as a function of computation time. The penalty function relates to how long each resident is stranded without hospital access (lower numbers are better). The penalty function corresponding to the organizers' solution is drawn in green. The red line depicts a strict lower limit to the penalty -- no solution can be better than this, given the amount of debris on the roads.

We can put our performance in more concrete terms: with our strategy, the average resident is stranded without hospital access for 2 days and 18 hours. The previous ‘optimal’ strategy kept the average resident waiting for 2 days and 21 hours, and naive strategies (i.e. clearing off roads at random) will keep residents waiting for over 4 days on average. These extra hours would cost many lives, since common post-disaster health problems like dehydration and cholera progress on the timescale of hours to days.

I’m pretty satisfied with our work these past few weeks. This approach isn’t too different from some of the data analysis tasks I tackle within astrophysics, and it was great to see these same techniques successfully handle a problem with real humanitarian benefit. I hope our solution gets taken under consideration in future disaster relief research — to encourage this, I’ve posted our code online.