Expanded Methodology and Accuracy of NSD PF Rankings

This article was written by Inko Bovenzi, who maintains our PF rankings

NSD’s PF rankings are at core an Elo system of ranking teams. In Elo systems, after every game/round occurs, the winning teams takes a certain number of points from the losing team based on the arithmetic difference in their scores, such that beating a strong team yields more points and beating a weaker team fewer points. The number of points is determined by the following formula:

K is a semi-arbitrary value that indicates the speed of change of the rankings, pd is the margin of victory, and winProb is the calculated probability that the winning team wins the round, calculated as follows

The variable ed represents the winning team’s Elo minus the losing team’s Elo. Thus, if two teams have equivalent Elos, the probability that one beats the other is, per the formula, 50%: radical. More interestingly, the probability that a team 400 points stronger than its opponent wins is 90.1%. As a further sanity check, as ed approaches positive infinity, the fraction approaches 1: a team that is infinitely stronger than its opponent will win.

NSD’s PF rankings are not merely a traditional Elo system though. To make our rankings more accurate, we make some adjustments to the conventional rating scheme. The practice of adjusting traditional Elo ratings is not an unusual methodology (see here how 538 adjusts conventional Elo in their NCAA March Madness predictions). We believe the following adjustments each have sound theoretical justification:

First, we set the margin of victory to be 1 for all prelims, and higher for elims based on the decision, in order to reflect that, for example,  a 7-0 decision is usually more decisive than a single prelim judge’s decision.

DecisionMargin of Victory Multiplier

Second, we reward participation in elimination rounds, both because breaking is an accomplishment in itself, and because it is a longstanding community norm to evaluate teams based on the extent of their participation in late elimination rounds, particularly at large tournaments. To this end, all elim losses result in a loss of only half the usual points, and all elim victories result in a bonus equal to the number of bids at the tournament divided by 2. For gold TOC, we took this adjusted number to be 12 (24 bids), and for silver TOC, 1 (2 bids).

Third and finally, all Elo shifts from winning or losing rounds, aside from the elim victory bonus, are weighted by the bid level of the tournament. This feature is important both because larger tournaments tend to have more reliable judging, and because teams tend to try harder (and thus more accurately reflect their abilities) when participating at a larger, more prestigious tournament.

These adjustments on the standard Elo rating system produce a formula for prelim round rating points change of:

A formula for elim victory points gain of:

And a formula for elim defeat points loss of:

Are these rankings accurate? We can check the accuracy of an Elo system using Brier scores, a method to compare probabilistic rankings. Brier scores are calculated as follows:

The variable n represents the number of data points, the predicted probability of the first team winning, and the actual result (1 or 0). This method of assessing rankings punishes both over- and underconfidence in predicted results.

NSD’s rankings had excellent Brier scores for gold TOC: In prelims, the rankings scored 0.187, in elims, 0.156, and overall, 0.185. It makes sense that the rankings more accurately predicted elims since elims are themselves more predictable– judging is more reliable with more experienced panels of judges.

Brier Scores and Skill ScoresPrelimsElimsOverall
Brier Score0.1870.1560.185
Brier Skill Score.252.376.260

Note: Brier scores are measured from 0 to 1; lower scores are better. Brier skill scores are measured from -∞ to 1, higher scores are better.

These Brier skill scores are quite good when one considers the inherent randomness in debate. Comparing them to 538’s Brier skill scores, we see that NSD’s rankings outperform roughly half of 538’s sports rankings’ Brier skill scores. 538’s politics forecasts perform outperform our rankings, but there is very little randomness in most political races when compared to sports or to debate, so predictive accuracy will always be much higher. Democrats (almost) always will prevail in California, and Republicans will (almost) always win in Wyoming, so most predictions in political forecasts are already near certainties.

Ben Kessler