A Statistical Analysis Of Side-Bias On The 2019 September-October Lincoln Douglas Debate Topic by Sachin Shah

The 2019 September-October resolution brings another topic to ascertain if the pattern of a negative side-bias observed over the course of last year continues to hold.

2019 September-October Data Set

Affirmative and negative ballots were gathered from tabroom.com from 8 Varsity Lincoln Douglas Tournament of Champions bid-distributing tournaments on the 2019 September-October topic across the country: Grapevine, Loyola, University of Kentucky, Greenhill, Yale, Valley, Long Beach, and Holy Cross. These tournaments range from octofinal to final bid level qualifier tournaments. This data set has a sample size of 2,703 ballots, representing fairly diverse debating and judging styles.

One-Proportion z-test

When all posted rounds on the 2019 September-October topic are analyzed, the negative won 50.77% of rounds. Now the question is whether the difference between 50.77% and what would be expected (50%) is statistically significant, or due to chance. In order to calculate a p-value to determine the answer, a one-proportion z-test was used. The null hypothesis was set to p = 0.5 (where p is the proportion of negative wins) since it is expected, barring any bias, that the affirmative and the negative would win the same number of times. The alternative hypothesis was p > .5. The alpha is set at 0.1. The z-test fails to reject the null hypothesis (p-value = 0.23, 99.99% confidence interval [46.7%, 54.8%]). This implies there is a 23% chance that the proportion of negative wins observed could occur if the rounds are also unbiased. There is insufficient evidence to conclude there is a negative side bias at a 90% confidence level.

The graph illustrates that a negative side bias is not pervasive across each tournament, ranging from -8% to almost 6% variance from an unbiased distribution. The main outlier is the University of Kentucky tournament, a finals bid. The negative won 42% of all rounds at that tournament, however this percentage does not extrapolate further due to the small sample size. It is also unlikely that this tournament replicates national circuit debate’s diverse judging and debating styles. It’s interesting that the two California tournaments had a variance over 4%. Most tournaments are under 1% from an unbiased distribution demonstrating the lack of a side-bias.

A More Robust Model

We can further characterize the side bias by taking into account the difference in the skill of each debater. The previous analysis assumes that each debater should have an equal chance of winning; the following analysis develops a more robust model that estimates the probability that each debater wins based on their respective skill level; rounds in which the affirmative debater is stronger are more likely to result in affirmative than negative wins. A common proposed method to address this concern is to limit the sample data to only elimination rounds or octofinal and quarterfinal bid tournaments only. In the 570 elimination ballots (double-octofinals through finals), the negative won 52.11% of ballots (p-value > 0.15). In the 5 octofinal and quarterfinal bid tournaments in the data set, the negative won 50.69% of rounds (p-value > 0.25). Both of these subsets still fail to conclude in a negative side-bias. However, this analysis arbitrarily sets the cutoff and includes debater skill disparities. For example, the elimination round model would still use results from high 6-0s debating low 4-2s. Additionally, because there have only been a few octofinal and quarterfinal bid tournaments on this topic so far, the sample size is too small for useful conclusions.

For a more robust account of debater skill differences, this study implemented an Elo rating system. This system rewards debaters more for defeating good debaters than defeating worse debaters. Each debater starts with a rating of 1500, then as they win or lose rounds their rating changes depending on the round difficulty. For example, if a 1500 rated debater loses to a 2000 rated debater, their rating would drop 1.597 points, while if they won their rating would rise 28.403 points. Each debater’s Elo modulates over the rounds they have. For the purposes of calculating Elo ratings for every debater, rounds were gathered from 106 TOC bid-distributing tournaments from 2017-2019 (YTD) with round results posted on tabroom.com.

One way to quantify the side-bias is to examine only the rounds where the lower Elo rating debater won, indicating an upset occurred. Theoretically the upsets should be equally distributed between affirmative upset wins and negative upset wins. In the 714 upset rounds across tournaments in the 2019 September-October data set, the negative won 52.38% of those rounds (p-value > 0.1, 99.99% confidence interval [45.1%, 59.7%]). This test demonstrates there is insufficient evidence of a significant difference between the amount of times the negative is able to overcome the debater disparity produced when the affirmative is slated to win and the amount of times the affirmative is able to overcome the disparity produced when the negative is slated to win. Therefore, there is insufficient evidence of a negative side-bias because there is not a significant difference between the amounts each side can overcome the debater skill level disparity.

To further quantify the lack of a side-bias, the proportion of negative wins when the affirmative was favored (p1) can be compared with the proportion of affirmative wins when the negative is favored (p2). Ideally, the difference between the proportions would be 0; however, p1 = 32.92% while p2 = 30.74%, a 2.18% difference. In order to determine whether this difference is statistically significant, a two-proportion z-test was used. The null hypothesis is p1p2 = 0, because that means both sides are able to overcome the debating level skew equally. The alternative hypothesis is then p1p2 > 0, meaning the negative is able to overcome the skew more than the affirmative, demonstrating a side-bias. This two-proportion z-test failed to reject the null hypothesis (p-value > 0.1, 99.99% confidence interval [-9.9%, 5.5%]). There is insufficient evidence that the negative is able to overcome the skew produced by debating level differences more often than the affirmative.

This analysis is statistically rigorous and relevant in several aspects: (A) The data is on the current 2019 September-October topic, meaning it’s relevant to rounds these months [1]. (B) The data represents a diversity of debating and judging styles across the country. (C) This analysis accounts for disparities in debating skill levels both in prelims and eliminations. (D) Multiple tests confirm the results.

Cost Functions:

One concern with the z-tests preformed lies with the alpha level. Choosing the alpha is functionally arbitrary. There are few reasons to prefer 0.1 to 0.05. There are two types of error: Type I, which occurs when the null hypothesis is rejected when it should not be, and Type II, which occurs when the null hypothesis isn’t rejected when it should be. The probability of each error type from occurring is Alpha and Beta respectively. Assuming each type of error occurring is equally bad, minimizing Alpha + Beta would lead to the best result. The error function looks like:

f(a, b) = Ca + Db

Where a and b are Alpha and Beta respectively and C and D are weight constants. C should be greater than D, if assuming there is bias when there is none is worse than assuming there is no bias when there is some. Minimizing f(a, b) would led to the best alpha level choice. To figure out whether Type I or Type II error is worse, understanding the consequences of a side-bias is in order. If there is no side bias then a regular debate should occur. If there is a side-bias then there should be some form of compensation. For example, a negative side-bias would mean affirming is harder and could perhaps justify the affirmative debater choosing a framework for the debate round. The form of compensation should likely be within the rules of debate i.e. the affirmative couldn’t give a 10-minute 2AR, however the affirmative might be justified in reading “unfair” arguments. In either case, without arbitrarily prioritizing one error over the other, choosing anything over 0.1 would be out of the ordinary [2]. Additionally, multiple tests all confirm there is insufficient evidence of a negative side-bias, increasing the validity.

Implications for Argument Justifications:

Obviously side-bias justified arguments would no longer make sense on this topic, however I would argue that other arguments that rely on a systemic skew would also fail to be justified. “Aff flex” and “Neg flex” usually are justified by making appeals to the nature of debate. For example, the strategy skew from the negative’s ability to select an advocacy based on the affirmative is a common justification for “aff flex” and the affirmative gets the first and last speech for “neg flex.” This style of argumentation discusses a fundamental difference between the affirmative and negative that may provide one side with the upper hand. Although this could be the case on other topics, the fact that this topic does not have a significant negative side-bias nullifies these arguments. A topic with no side-bias means even if there are structural concerns that gives one side the advantage, the topic is constructed such that it’s evened out. In simple terms, when there is no empirical side-bias, there is no structural issue that favors one side over the other, because if there was, it would show up in the side-bias analysis. This will require debaters to innovate new arguments to justify “aff flex” and “neg flex” because the standard “7-4-3-6 time skew” will likely not justify the argument on this topic.

Potential Explanations:

This is the first topic where there has been a study that failed to demonstrate a negative side-bias. Previous studies found a negative side-bias to some extent in various data sets from 2011 – 2015 and 2018 – 2019 [3]. Even the larger data set used to calculate Elo ratings above (106 TOC bid distributing tournaments from 2017 – 2019 YTD) shows a strong negative side-bias with the negative winning 52.60% of ballots (p-value < 0.0001, 99.99% confidence interval [51.8%, 53.8%]) and 54.45% of upset rounds (p-value < 0.0001, 99.99% confidence interval [52.5%, 56.4%]). The question then becomes what makes this topic different than the others. I think debaters should choose a possible explanation (whether from the list below or not) and justify why their model makes the most sense. There are many plausible explanations, and it’s likely to be a combination of multiple, so having discussions in and out of round will help understand the nature of debate.

(1) If the negative side-bias has historically been caused from topical advantages then one possible explanation would be this resolution simply has better ground on both sides than past topics. Better ground could suggest nuanced clash and thus more resolvable rounds.

(2) Assuming the negative derived an advantage from topical theory arguments then it would be suggested that this topic lacks the need for theory. One common category of topical theory arguments would be spec shells. However there are only so many standardized tests undergraduate admissions consider in the first place, the most common probably being the SAT and ACT. Colleges and universities might be parameterized; however, it seems any argument for why one college shouldn’t consider standardized tests would apply to all other colleges. If the affirmative does not defend a small plan, then the negative loses must not spec shells and if the spec shells are not convincing then the negative would lose access to a large section of topical theory arguments.

(3) Perhaps presumption and permissibility both flowed negative in the past giving them the upper hand in close rounds. The resolution’s use of “ought not” might tip the permissibility ground towards the affirmative, rectifying a former skew.

(4) It could be the case that the 7 minute 1NC and up layering the debate away from issues of the affirmative skews rounds towards the negative. If so maybe either the negative has less up layering options on this topic or affirmatives are better built to deal with the up layering.

Topical Discussion:

In order to be more informed about the types of arguments debaters are reading on this topic, this study implemented a web crawler system to parse every debater’s NDCA LD wiki page. A list of all affirmative case titles containing a phrase similar to “septoct” as disclosed in the Cites Box was compiled. The list was modified such that each entry only contained letters and did not have extraneous phrases such as AC, aff, and v1. Each unique case name was then counted. Of the 602 case entries meeting the criteria above, the top 3 most common affirmatives were Diversity (40), Stock (37), and Deleuze (18). In the top 30%, the only plan was the SAT and ACT plan (based on case titles). It’s interesting that there are few plans within this dataset.

A similar process was used for compiling a list of negative case names. Position categories (T, CP, DA, NC, K, and PIC) were sorted out of the name and appended to the end to account for naming scheme differences. The top 3 most common negatives in the data set were Nebel T (49), Test Optional CP (39), and Grade Inflation DA (23). In the top 50 unique positions, theory accounts for 6, while disadvantages for 19. This seems to support explanation number 2 because theory is read at a lower frequency in comparison to DAs and Ks. That being said, Kant and Particularism were the only traditional philosophy NC positions in the top 50. If permissibility and presumption are only read with philosophy positions, then explanation number 3 might have some merit because the negative doesn’t read philosophy NCs too often. Over 50% of all cases were some form of consequentialist position whether a disadvantage or a counterplan, potentially justifying explanation number 4 because disadvantages and counterplans typically clash well with affirmative substance. Upon a manual examination of the case list it appeared that many of the K’s were generic rather than topic specific. Additionally there were only 7% more unique negative positions than unique affirmative positions, which could support explanation number 4, because it means there is a closer to 1:1 ratio between affirmative and negative positions.

There are a few problems with this data collection method, for example “Korsgaard AC” and “Kant AC” would count as 2 different case names, when they are likely the same position, meaning the frequency counts could be off. In addition, this data collection method assumes debaters adhere to a standard name across the circuit for similar cases. Potential solutions would likely involve a form of manual inception to account for different naming conventions.

The 2019 September-October topic is one for the history books in many aspects. Affirming is not harder, there are few plans, the negative reads generics just as much as topical arguments, and more. As this topic unfolds in the remaining tournaments, we will see how the clash evolves with new positions and greater understanding of the topic.

Footnotes

[1] It is important to note that numbers presented in this article that use the 2019 September-October data set should only be used within the context of the 2019 the September-October topic; debaters who attempt to extrapolate that data to future topics would be misrepresenting the intent of this article. The data set that utilizes 2017 – 2019 tournaments could be extrapolated to future topics, as it suggests a trend.

[2] Lavrakas, P. J. (2008). Encyclopedia of survey research methods, Thousand Oaks, CA: Sage Publications, Inc. doi: 10.4135/9781412963947

[3] Some previous side-bias studies:

Anderson, Jim. “a closer look at the LD time skew” January 19, 2014. http://decorabilia.blogspot.com/2014/01/a-closer-look-at-ld-time-skew_19.html

Adler, Steven. “Are Judges Just Guessing? A Statistical Analysis of LD Elimination Round Panels.” NSD Update March 30, 2015. http://nsdupdate.com/2015/03/30/are-judges-just-guessing-a-statistical-analysis-of-ld-elimination-round-panels-by-steven-adler/

Shah, Sachin. “A Statistical Analysis of Side-Bias on the 2018 September-October Lincoln-Douglas Debate Topic by Sachin Shah.” NSD Update. October 11, 2018. http://nsdupdate.com/2018/a-statistical-analysis-of-side-bias-on-the-2018-september-october-lincoln-douglas-debate-topic-by-sachin-shah/.

Shah, Sachin. “A Statistical Analysis Of Side-Bias On The 2018 November-December Lincoln-Douglas Debate Topic.” NSD Update. November 16, 2018.

A Statistical Analysis Of Side-Bias On The 2018 November-December Lincoln-Douglas Debate Topic

Shah, Sachin. “A Statistical Analysis Of Side-Bias On The 2019 January-February Lincoln-Douglas Debate Topic.” NSD Update. February 16, 2019. http://nsdupdate.com/2019/a-statistical-analysis-of-side-bias-on-the-2019-january-february-lincoln-douglas-debate-topic/.

Grant Brown