Are Judges Just Guessing? A Statistical Analysis of LD Elimination Round Panels by Steven Adler

Are Judges Just Guessing?:

A Statistical Analysis of LD Elimination Round Panels

By Steven Adler

Debate judges often bemoan how difficult rounds are to decide: Debaters never seem to do quite enough weighing or issue comparison; evidence is frequently over-tagged; and speakers are often muddled in their delivery, even during important rebuttals. Given these concerns, one might fairly ask – can judges really tell who is winning in a debate round, or are they often grasping at straws?

This question has important implications for our beliefs in the competitive merit of the activity, as well as for possible norms going forward. National circuit LD’s high speeds, dense philosophy, and theoretical complexity are often defended as educational because they allow a more thorough discussion and evaluation of the topic—but what good are these values if even the judge cannot parse them. Meanwhile, debaters generally believe that investing hard work into debate will increase their competitive ability, and that wins in debate rounds are generally awarded on the basis of merit—but a ‘random’ decision process by judges might suggest an upper-bound on how often wins are truly deserved. As such, it is important for LD to consider whether judges can distinguish a round’s true winner.

Before dealing with these potential implications, however, we must first handle the actual data. This question is fundamentally one of statistical independence—whether the ballots cast by judges in elimination rounds can be modeled as separate random draws from a probability distribution, or whether judges agree more often than would be predicted by individual random draws. Thankfully, statistical tests exist that would allow us to test such a hypothesis.

More specifically, one might construct an ‘expected’ distribution of round outcomes (# of 3-0 Affs, 2-1 Affs, 2-1 Negs, 3-0 Negs) if judges were randomly voting over some probability distribution, and then use a Chi-square independence test to determine whether the observed distribution of outcomes is statistically significantly different from what you would see if judges were randomizing. If so, then this could be evidence either that judges can discern a winning debater (if the amount of 3-0 decisions is overrepresented), or that they are particularly bad at discerning a winner (if the amount of 2-1 decisions is overrepresented).

I have gathered data to perform such an analysis: I have recorded the available elimination round data from octas-bid LD tournaments for the 2014-15 season via Tabroom.com, and coded them along three values: whether the Aff or the Neg won; whether the decision was unanimous; and whether the elim took place before quarterfinals, or from quarterfinals on.

In sum, this provided me with a dataset of 216 elimination rounds from the Harvard, Bronx Science, Glenbrooks, Valley, Harvard-Westlake, and St. Mark’s tournaments (with the exception of Bronx Science, I included data starting with the first full elim—so full trips for Harvard, but full doubles for Valley, despite the occurrence of some run-off rounds prior to that). Among these 216 rounds, the Affirmative won 98 of them for a win-rate of 45%. 121 of the 216 rounds were unanimous, for a unanimity-rate of 56%. The percentages of specific decision outcomes are listed in the table below:

Elim gh 1

 

To determine whether these outcomes are consistent with each judge randomizing on their individual ballot, one must construct an ‘expected’ set of observations according to the randomization process. A naïve randomization might suggest that each judge has a 50-50 chance of voting Aff vs. Neg on their individual ballot, and that therefore the distribution would look something like this:

 

Elim gh 2

In Lincoln-Douglas, however, there is a strong suspicion that Negatives win more often than 50% of the time, so there is good reason to think that this ‘random’ hit-rate for Neg winning should not be 50%. Indeed, this advantage for Negatives is present in the analysis’s data, so we must adjust the expected randomization rate accordingly (bear with me here, because this will involve a little math).

Using the aggregate data, we can calculate the Neg win-rate for these elimination rounds as 54.7%. To win a round as Neg, you could either win on a 3-0 decision or on a 2-1 decision (with three possible combinations of how this 2-1 could occur, depending on which judge sits). Accordingly, if judges were randomizing their ballots, we would need to find a ‘randomization rate’ for a Negative ballot such that x^3 + 3*x^2*(1-x) = 0.547, which produces an x value of 0.531. (This math is simpler than it seems: About 15% of the time with this randomization rate you would expect a 3-0 for the Negative, and just under 40% of the time it would produce a 2-1 for the Negative, so it combines to the total Neg win-rate of 54.7%.)

So, we now know the rate at which each judge should be ‘randomly’ voting for Neg vs. Aff if it were to construct an overall Neg win-rate of 54.7%, and using that information we can construct what the expected data for the entire set of outcomes would be:

 

Elim gh 3

 

You can tell visually that these distributions are very different from one another. Whereas the expected distribution predicts far more 2-1 decisions than 3-0 decisions, in practice 3-0s are quite prevalent. A 3-0 decision for the Negative is the most likely of any individual outcome, and a 3-0 decision for the Affirmative is nearly as likely as a 2-1 decision in the Aff’s favor. According to these data and the accompanying Chi-test, there is a next-to-zero (p < 0.0001) chance that this distribution could be represented by judges randomizing in the ways described above; this is relatively strong evidence that judges are able to discern winners and that they generally agree with one another on this distinction, or at least more than mere chance would predict.

Yet a plausible objection here might be that maybe the elimination round data need to be further segmented. For instance, perhaps the data do not meet this randomization because judges can easily distinguish between winners and losers in early elimination rounds, which typically contain more-lopsided matchups, but that in late elimination rounds the decision is much murkier. In fact, I find some support for this hypothesis, though it may be an artifact of a smaller sample-size for this segment.

To evaluate this hypothesis, I replicated the above analysis, but pared down to the 36 coded rounds that took place in quarterfinals or later. In these rounds, the Neg side-bias was even more pronounced, with Neg winning 61% of elimination rounds, so the ‘expected’ randomization rate on ballots to achieve such an overall win-rate would be 57% for the Neg and 43% for the Aff. This creates the following expected distribution, compared to the actual observed distribution for these late elimination rounds:

 

Elim gh 4

Once again 3-0 decisions seem to be more prevalent than would be expected by randomization data, but the Chi-square test does not come back statistically significant at a p<0.05 level (p=0.077). For this data-set, therefore, judges’ decisions could be modeled by individual random draws from a 43-57 probability distribution, and we cannot establish statistical significance for differentiating the observed from those random draws. According to this p-threshold, we cannot reject the null hypothesis that judges are essentially randomizing their individual decisions in late elimination rounds, though we are very close to doing so.

As noted before, this is certainly not an ironclad finding: The observed p-value of 0.077 is very close to the typical significance threshold of 0.05, and there is certainly reason to think that a larger dataset could reduce uncertainty/variance and thus achieve statistical significance. Additionally, there are other plausible explanations, such as debaters sometimes targeting 2 judges on a panel, or lopsided prefs such that a debater is more likely to only win 2 judges; more complete data would help evaluate these hypotheses. But for the time being, the data are inconclusive about judges’ ability to consistently discern winners in late elims, and that might give the community some pause.

Overall, the national circuit LD community should take relative solace in these findings: Even though judges often gripe about the difficulty of resolving rounds, and it would be easy to fall into believing that most decisions are mere rationalizations, there seems to be something about these decisions that persists across judges, causing 3-0s to be overrepresented. We are likely justified in believing that decisions are well-merited. But even if we were not, debaters could (and should) still continue working on skills that distinguish between them and their competitors—clarity, evidence-comparison, crystallization. Meanwhile, LD might consider structural forms (say, lengthening speeches) to allow debaters to distinguish themselves even further and increase the chance that debaters’ skills can manifest themselves in round—but again, the situation is not critical. For the first time, there is evidence that judges in LD are not merely randomizing in their decisions about for whom to vote in elimination rounds, and this should give us some confidence that the activity has not gone entirely off the rails.

 

  • Shahbaz Shaikh

    You should try instituting strata in your outround participants with seeds, prelim success, flip win rates and ways of anticipating prep outs (affs broken and hidden), but good luck finding this data lol. This would help you figure out randomness – there’s only random guessing if the judges do not really know the “answer” and must decide themselves for whatever side. Your application of the p value may be missing whether it is one or two sided (as far as I can tell), and having win rate by contestant and win rate by judge predisposition is a more astute way of saying whether someone was capable of winning and whether the judging reflected that. These are also better indicators of a round, each aff neg node are certainly much different, and simulation is far from the finesse of a good Nature/Information diagram.

  • Greg Malis

    A methodology question…If the null hypothesis is that judges are essentially guessing random (i.e., unable to discern who is actually winning), then why correct for the side bias? The basis of the bias is that the structure of an LD round makes it more likely for the neg to win. If judges were just guessing, then there wouldn’t be a side bias at all. Unless the well-documented bias affects the guess, assuming that the null is true? Regardless, I think it’s a good statistical analysis. I might use this in my Stats class, as I think it’s clever use of a Chi-Square. Technically, I think it’s the test for goodness of fit, not independence, but it still is very useful and informative.

  • Michael O’Krent

    Ha! Something I’ve been dying to know since novice year. Great work. This is a very clever appliation of the Chi-square test.
    I have a methodological suggestion for future analyses of this sort. If you want to quantify side bias as part of the expected distribution when judges are guessing, the aff-neg probability distribution should come from something other than the results of elimination rounds (i.e., the data set you’re testing). The reason why is that the neg ballot-rate (54.7%, I believe) occurs as a result of the judges guessing or not guessing. If judges are just guessing, we don’t know what the “actual” side bias should look like (e.g., maybe if judges knew what they were doing, 48% of ballots would be neg, or any other number). We’re unable to isolate side bias as a variable from your data unless we already know that judges aren’t guessing (in which case coin flips should even out debaters’ skills over a large sample size). In logical terms, then, the second part of this analysis begs the question by assuming that judges aren’t guessing. It does work well as a test of consistency by showing that our belief that judges aren’t guessing is internally consistent with actual outround results, but it can’t falsify the hypothesis without an external way to quantify what the aff-neg ballot percentages would look like if judges were actually just guessing.
    I have no idea where such a number would come from, because of course the only way to quantify side bias is through empirical data. Like everything that turns out to be important, side bias does not show up in quantitatively in theory. One possibility would be using prelim data to get that distribution; however, side bias probably isn’t the same between prelims and outrounds. It’s probably greater in outrounds.
    Instead, I think the first part of the analysis, which used 50-50 as the expected distribution, is pretty close to spot-on. If side bias shows up in the empirical data (which it does), that’s probably because negs are, in fact, more likely to win outrounds (also largely considered true), and judges correctly decide that more negs are winning outrounds. The very existence of side bias can actually be taken as evidence that judges aren’t guessing.
    Really the main thing (with respect to side bias) the 50-50 distribution leaves out is the possibility that the perception of side bias subconsciously influences judges’ decisions. This would mean that because everyone already thinks that negs are more likely to win outrounds, judges become more likely to vote for negs to justify that belief or something psychological like that. My intuition suggests that this effect would be pretty small. But of course, that intution could be wrong. This is what needs testing: 1. Whether the perception of side bias influences judges, and 2. Whether judges are just guessing in outrounds while being influenced by the perception of side bias. And naturally, these hypotheses suffers from the same problem of being unable to construct an expected distribution – I have no idea how to even begin to quantify the subconscious effects of perceived side bias. Maybe there’s an approach that does not require a preexisting theoretical model? I can’t think of one, but I’m also a literature major so who knows.

    • Steven Adler

      What’s up Michael, thanks for the thoughtful feedback.

      You’re absolutely right that there’s some problem of “endogeneity” with the Neg-guess rate coming from this same dataset. One idea that I had for solving that problem was using Neg win-rates from Henson & Dorasil (2011), but decided that those were too out-of-date, since they covered years ’04-’09: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1768087

      I also think your hunch about a difference between prelims and elims is on-point. If I recall correctly, previous studies of side bias have found that it’s fairly negligible in prelims (because the skill disparity is often much greater), but takes over in elims (perhaps because high-level negating is much easier than high-level affirming; or perhaps because of built-in biases we now have about thinking negs win elmis; or perhaps for some other undiscussed reason).

      Anyways, I think there’s good anecdotal reason to believe that perception of side bias does influence round outcomes: thresholds for Aff extensions; lenience in presumption arguments; decisions to grant RVIs; and many other sometimes-decisive judge calls do seem to be influenced, at least some of the time, by the judge’s prior beliefs about the fairness of Aff/Neg splits.

      Given those above points, in conjunction with the fairly uncontroversial claim that Neg does win more often, I do think that speaks to a need to represent guessing in some form other than 50/50. Specifying that form is a much tougher question, but I do think it should be non-50/50.

  • Emily Massey

    Neat! Thanks for gathering and analyzing the data, Steven.

  • Samir Reddy

    This is awesome. Prayer hand emojis to Steven.