Difference between revisions of "Difficulty"

From QBWiki
Jump to navigation Jump to search
 
(11 intermediate revisions by the same user not shown)
Line 13: Line 13:
  
 
===Objective===
 
===Objective===
There is a pervasive notion that the difficulty of a set (or a question, or a clue) may be distilled into a single numerical value which describes how hard it was on an "absolute" scale. Statistics like [[PPB]], [[power]] rate, and [[BPA]] are examples of objective quantifiers that are often employed to serve this purpose - it is more accurate to say that describe how well ''the field did''. Nevertheless, hard data are very useful for talking about and comparing difficulties.
+
There is a pervasive notion that the difficulty of a set (or a question, or a clue) may be distilled into a single numerical value which describes how hard it was on an "absolute" scale. Statistics like [[PPB]], [[power]] rate, and [[BPA]] are examples of objective quantifiers that are often employed to serve this purpose - it is more accurate to say that these describe how well ''the field did''. Nevertheless, hard data are very useful for talking about and comparing difficulties.
  
 
Assertions about how hard something was are tacitly assumed to be based on this sort of evidence, as relying purely on one's personal perception requires an undue amount of generalization. Such statements may be phrased as if it were possible to determine something's absolute difficulty ("this question ''is'' too hard...") but any potential confusion from this grammatical convention can alleviated by making oneself clear ("...because it wasn't powered in any site").
 
Assertions about how hard something was are tacitly assumed to be based on this sort of evidence, as relying purely on one's personal perception requires an undue amount of generalization. Such statements may be phrased as if it were possible to determine something's absolute difficulty ("this question ''is'' too hard...") but any potential confusion from this grammatical convention can alleviated by making oneself clear ("...because it wasn't powered in any site").
Line 21: Line 21:
 
*Wikipedia page views
 
*Wikipedia page views
 
*various Google statistics (Trends, Ngram Viewer, search results)
 
*various Google statistics (Trends, Ngram Viewer, search results)
While it is true that these are free of many of the assumptions that plague other statistics, this cuts both ways: raw data like this is not particularly useful because it is so far removed from the actual experience of playing the game.
+
While it is true that these are free of many of the assumptions that plague other statistics, this cuts both ways: measures like these are not particularly useful because they are so far removed from the actual experience of playing the game.
  
 
====Relative====
 
====Relative====
Despite how it would be nice, it is generally only possible to determine the relative difficulty of a clue/question/set for numerous reasons:
+
Despite how nice it would be, it is generally only possible to determine the relative difficulty of a clue/question/set for numerous reasons:
 
* The composition of a field can have considerable impacts on stats - for instance, the absence of strong players in a category can depress power numbers, making the subject appear more difficult.
 
* The composition of a field can have considerable impacts on stats - for instance, the absence of strong players in a category can depress power numbers, making the subject appear more difficult.
* The difficulty of individual clues can be skewed considerably by appearing in other sets frequently, recently, or both. This is a major factor in why some questions play significantly easier after time passed: inclusion of a piece of information into [[the canon]] makes it significantly easier for players who [[packet study|pay attention to it]].
+
* The difficulty of individual clues can be skewed considerably by appearing in other sets frequently, recently, or both. This is a major factor in why some questions play significantly easier after time has passed: inclusion of a piece of information into [[the canon]] makes it significantly easier for players who [[packet study|pay attention to it]].
 
* Without some sort of comprehensive survey of all players, any single value will be incomplete in describing how difficult the community as a whole finds something.
 
* Without some sort of comprehensive survey of all players, any single value will be incomplete in describing how difficult the community as a whole finds something.
* Even a perfect description of something's difficulty within the game will not be able to describe how hard it is in broader society. It is known (and frequently commented on) that the demographics of quiz bowlers are substantially different from the general population. Despite this, metrics like difficulty and [[importance]] are often pinned to how well the average person (from the street or in a field) would know it.
+
* Even a perfect description of something's difficulty within the game will not be able to describe how hard it is in broader society. It is known (and frequently commented on) that the demographics of quiz bowlers are substantially different from the general population. Despite this, metrics like difficulty and [[importance]] often reference how well the average person (from the street or in a field) would know it.
  
One can know the ''direction'' which stats have been skewed by these factors, but not the precise magnitude. These factors are often small enough (and the bins of "difficulties" are large enough) that most observers will broadly agree - for instance, even though some collegiate [[two-dot]] sets are harder than others, they are almost always closer to one another than they are to [[three-dot]] sets.
+
One can typically figure out the ''direction'' which stats have been skewed by these factors, but not the precise magnitude. These factors are often small enough (and the bins of "difficulties" are large enough) that most observers will broadly agree - for instance, even though some collegiate [[two-dot]] sets are harder than others, they are almost always closer to one another than they are to [[three-dot]] sets.
  
 
==Regular difficulty==
 
==Regular difficulty==
Line 47: Line 47:
 
[[Ophir Lifshitz]] has created a four-dot [https://collegequizbowlcalendar.com/difficulty-scale/ difficulty scale] to remove ambiguities in difficulty terminology.
 
[[Ophir Lifshitz]] has created a four-dot [https://collegequizbowlcalendar.com/difficulty-scale/ difficulty scale] to remove ambiguities in difficulty terminology.
  
== High School Level ==
+
==High School Level==
  
At the high school level, [[HSAPQ]] tournament sets and [[IS|NAQT IS]] sets are considered the standard for regular difficulty. Most other sets are described in terms of how much easier or harder than these sets a tournament is expected to be. [[HSQBRank]] keeps a set of "stat adjustments" that measures the difficulty of different packet sets: NAQT IS sets are set to zero, while more positive numbers indicate more difficult sets and more negative numbers indicate easier sets.
+
At the high school level, [[HSAPQ]] tournament sets and [[IS|NAQT IS]] sets are considered the standard for regular difficulty. Most other sets are described in terms of how much easier or harder than these sets a tournament is expected to be.
 +
 
 +
===Gradations===
 +
With few exceptions, high school quiz bowl tournaments span a range of difficulty roughly equivalent to the gap between [[NAQT Collegiate Novice]] and [[ACF Winter]] (i.e. one-third of the standard collegiate calendar, or between one-dot and two-dots on the collegiate calendar difficulty scale). Despite this, high school quiz bowl has an equal number of difficulty gradations (if not more). For instance, the difference between "nationals-minus" and "regs-plus" sets is often considered to have significant gameplay consequences on top high school teams; by contrast, collegiate "[[nats-minus]]" has more or less absorbed "Regs-plus" (though there are certainly those who would argue the same for these two levels).
 +
 
 +
There are various reasons for this - here is an attempt at several of them:
 +
*High school quiz bowl is a distinct entity from the college game. The majority of players (including those who are invested enough to potentially be reading this article) will not continue [[How to transition from high school to college|after matriculating]] - as such, their terminology and perspective is optimized around the version of the game they actually play. For much the same reason, many players view preparing for difficulties above that of nationals somewhere between "irrelevant" and "actively detrimental to performance".
 +
*The development of progressively better resources for studying like databases has hyper-accelerated the inclusion of information into the [[canon]]. A decade ago it would have taken a while for players to notice that questions on {{q|the {{bu|Mali}} Empire}} are largely the same year-to-year; in the modern day, that same observation can be made in one second by searching up the [[answerline]] on [[QBReader]]. Enfranchised players sometimes interpret this lack of novelty as a sign of lower difficulty (or lower quality<ref>[https://hsquizbowl.org/forums/viewtopic.php?p=384510#p384510 The majority of high school tossups on the Mali Empire are terrible.] by [[Vixor]] » Wed May 12, 2021 8:06 pm</ref>) rather than a natural consequence of writing a question for the expected level of knowledge of [[the field]].
 +
 
 +
Difficulty modifiers like "+" and "-" are often used to specify these fine gradations (e.g. "regs-", "regs+", "nats-"). Comically-long strings of these modifiers are common in facetious examples of this system, like "nats-++-" ("nationals minus-plus-plus-minus") to describe a set infinitesimally easier than a "nats-++" set - the jury is out on whether this would be the same as "regs+--".
 +
 
 +
===Metrics===
 +
[[HSQBRank]] historically kept a set of "stat adjustments" which measured how teams performed on different packet sets: NAQT IS sets were normalized to zero, with positive numbers indicating a set played more difficult and negative numbers indicating that it played easier. Its spiritual successor [[Groger Ranks]] has maintained similar numbers using its own system of assessing sets.
 +
 
 +
These metrics are often taken to be a direct calculation of the relative difficulty of sets. This impulse should be tempered:
 +
*Both HSQBRank aPPB and Groger Rank bonus adjustments are derived from regression using the assumption that "adjustments between sets are purely additive"<ref>[[Steven Liu|Liu, Steven]]. “Groger Ranks 2019-20 Methodology Changes.” Groger Ranks, November 11, 2019. [https://grogerranks.files.wordpress.com/2019/11/gr_2019_methodology.pdf].</ref><ref>[[Fred Morlan|Morlan, Fred]]. “FAQ.” HSQBRank, September ?. [https://hsqbrank.com/faq/].</ref> - while empirical data has shown that this is approximately true in many cases, this is less true when conversion is above 20 PPB.
 +
*Adjustments are based on the comparison of [[IS]] sets and [[housewrite]]s. However, there is a huge difference in both field size and composition between the two kinds of set. For instance, in the 2022-23 season roughly the same number of schools on the "top 100" ranking published by Groger Ranks played the housewrites [[DART III]] and [[KICKOFF]] as [[IS-215]]. However, over three times as many total teams played IS-215, meaning that the ''fraction'' of top 100 schools at the housewrites was much higher (~70% vs. ~25%). Such a strong selection bias can have significant consequences on the observed performance of the field (and hence the adjustment given to the set), with some loose estimates suggesting it could totally mask the true difficulty of the set.<ref>[https://hsquizbowl.org/forums/viewtopic.php?p=395014#p395014 Re: Concern for HS Regs+ Sets this Season] by [[Santa Claus]] » Mon May 08, 2023 3:48 pm</ref>
 +
*These ranking systems are designed to estimate relative strength of teams and are typically judged by their ability to predict nationals performances or head-to-head match-ups.<ref>[https://www.hsquizbowl.org/forums/viewtopic.php?p=232529#p232529 Accuracy of HSQBRank] by [[AKKOLADE]] » Mon Jan 09, 2012 12:22 pm</ref> While they do as well as one could hope for from the data available, none of these systems were intended to describe set difficulty and as such cannot be reasonably be expected to do so with the same degree of accuracy.
 +
 
 +
There has not been an attempt to "adjustment"-style ranking of college teams in recent memory. Possible reasons include smaller field sizes, a lower number of tournaments to compare, and a general lack of interest.
  
 
== Middle School Level ==
 
== Middle School Level ==

Latest revision as of 18:05, 22 November 2023

Difficulty can refer to either or both of the following:

  1. How hard the questions at the tournament were for the players to answer, as measured either subjectively by the players themselves or objectively through conversion statistics.
  2. How hard the writers or editors of the tournament expect the questions to be, by analogy to a previously-played tournament or general standard. This is often denoted target difficulty.

Theory

Difficulty is a concept that can be applied to an entire tournament, a specific packet, one question, or even a single clue. The ideal standard for difficulty would be an objective assessment of how well-known a fact (or collection of facts) is among the quizbowl community or the wider populace. This is, of course, impossible - determination of difficulty is thus a combination of various imperfect measures.

Subjective

Personal perception of difficulty is the most immediate and straightforward way to determine how hard something is. However, it is inevitably biased by the idiosyncrasies of one's knowledge base - it is possible (and indeed, fairly common) for someone to only know the leadin of a tossup or the hard part of a bonus. The pitfalls of this approach may be succinctly summarized in the aphorism "If I know it, it's too easy; if I don't, it's too hard" (sometimes called the fundamental difficulty error).

Joining together multiple opinions can reduce these fluctuations and produce a single consensus on difficulty; any overarching biases can then be viewed as broad community tendencies, though it's rare that these are obvious. Creating this sort of aggregation is one function of post-tournament discussion, along with identifying errata and talking about other aspects of the set (answer choice, distribution/subdistribution, writing philosophy, logistics, etc.).

Objective

There is a pervasive notion that the difficulty of a set (or a question, or a clue) may be distilled into a single numerical value which describes how hard it was on an "absolute" scale. Statistics like PPB, power rate, and BPA are examples of objective quantifiers that are often employed to serve this purpose - it is more accurate to say that these describe how well the field did. Nevertheless, hard data are very useful for talking about and comparing difficulties.

Assertions about how hard something was are tacitly assumed to be based on this sort of evidence, as relying purely on one's personal perception requires an undue amount of generalization. Such statements may be phrased as if it were possible to determine something's absolute difficulty ("this question is too hard...") but any potential confusion from this grammatical convention can alleviated by making oneself clear ("...because it wasn't powered in any site").

Absolute

There are various metrics which are sometimes used to approximate an absolute measurement of difficulty. These are largely unorthodox, as it is accepted that conventional methods like measuring PPB are very field-dependent:

  • Wikipedia page views
  • various Google statistics (Trends, Ngram Viewer, search results)

While it is true that these are free of many of the assumptions that plague other statistics, this cuts both ways: measures like these are not particularly useful because they are so far removed from the actual experience of playing the game.

Relative

Despite how nice it would be, it is generally only possible to determine the relative difficulty of a clue/question/set for numerous reasons:

  • The composition of a field can have considerable impacts on stats - for instance, the absence of strong players in a category can depress power numbers, making the subject appear more difficult.
  • The difficulty of individual clues can be skewed considerably by appearing in other sets frequently, recently, or both. This is a major factor in why some questions play significantly easier after time has passed: inclusion of a piece of information into the canon makes it significantly easier for players who pay attention to it.
  • Without some sort of comprehensive survey of all players, any single value will be incomplete in describing how difficult the community as a whole finds something.
  • Even a perfect description of something's difficulty within the game will not be able to describe how hard it is in broader society. It is known (and frequently commented on) that the demographics of quiz bowlers are substantially different from the general population. Despite this, metrics like difficulty and importance often reference how well the average person (from the street or in a field) would know it.

One can typically figure out the direction which stats have been skewed by these factors, but not the precise magnitude. These factors are often small enough (and the bins of "difficulties" are large enough) that most observers will broadly agree - for instance, even though some collegiate two-dot sets are harder than others, they are almost always closer to one another than they are to three-dot sets.

Regular difficulty

Main page: Regular difficulty

Regular difficulty is the normative difficulty for questions at a given level of quizbowl. Theoretically, it represents the difficulty level at which any eligible closed team across the whole range of skill levels can play meaningful games against any other eligible team. For example, a regular-difficulty high school set should have a distribution, selection of clues/answers, etc. that allows the more knowledgeable high school team in a given match to consistently win,[1] regardless of whether it's a match between weak teams, average teams, or strong teams.

In practice, regular difficulty sets may not align with the optimal difficulty for the population of active teams, especially among the subset that are nationally competitive. This can skew either way: in high school, the regular difficulty (as set by IS sets) is often considered to be "too easy", while in college regular difficulty (currently still set by ACF Regionals) it is "too hard".

College Level

See also: Collegiate difficulties

At the college and open levels of quizbowl, the four main general standards of difficulty (in increasing order of difficulty) are: novice, regular, nationals, and post-nationals. The first three levels roughly (but not exactly) correspond to the difficulty level of previous ACF Fall, ACF Regionals, and ACF Nationals sets, respectively; the fourth is reserved for anything harder than ACF Nationals.

There have been efforts to reframe "regular difficulty" as something easier than of ACF Regionals, which would be described "Regionals difficulty" instead. ACF Winter, the ACF tournament intermediate in difficulty to Fall and Regionals, lies in this range and returned after a ten-year hiatus in 2020.

Ophir Lifshitz has created a four-dot difficulty scale to remove ambiguities in difficulty terminology.

High School Level

At the high school level, HSAPQ tournament sets and NAQT IS sets are considered the standard for regular difficulty. Most other sets are described in terms of how much easier or harder than these sets a tournament is expected to be.

Gradations

With few exceptions, high school quiz bowl tournaments span a range of difficulty roughly equivalent to the gap between NAQT Collegiate Novice and ACF Winter (i.e. one-third of the standard collegiate calendar, or between one-dot and two-dots on the collegiate calendar difficulty scale). Despite this, high school quiz bowl has an equal number of difficulty gradations (if not more). For instance, the difference between "nationals-minus" and "regs-plus" sets is often considered to have significant gameplay consequences on top high school teams; by contrast, collegiate "nats-minus" has more or less absorbed "Regs-plus" (though there are certainly those who would argue the same for these two levels).

There are various reasons for this - here is an attempt at several of them:

  • High school quiz bowl is a distinct entity from the college game. The majority of players (including those who are invested enough to potentially be reading this article) will not continue after matriculating - as such, their terminology and perspective is optimized around the version of the game they actually play. For much the same reason, many players view preparing for difficulties above that of nationals somewhere between "irrelevant" and "actively detrimental to performance".
  • The development of progressively better resources for studying like databases has hyper-accelerated the inclusion of information into the canon. A decade ago it would have taken a while for players to notice that questions on the Mali Empire are largely the same year-to-year; in the modern day, that same observation can be made in one second by searching up the answerline on QBReader. Enfranchised players sometimes interpret this lack of novelty as a sign of lower difficulty (or lower quality[2]) rather than a natural consequence of writing a question for the expected level of knowledge of the field.

Difficulty modifiers like "+" and "-" are often used to specify these fine gradations (e.g. "regs-", "regs+", "nats-"). Comically-long strings of these modifiers are common in facetious examples of this system, like "nats-++-" ("nationals minus-plus-plus-minus") to describe a set infinitesimally easier than a "nats-++" set - the jury is out on whether this would be the same as "regs+--".

Metrics

HSQBRank historically kept a set of "stat adjustments" which measured how teams performed on different packet sets: NAQT IS sets were normalized to zero, with positive numbers indicating a set played more difficult and negative numbers indicating that it played easier. Its spiritual successor Groger Ranks has maintained similar numbers using its own system of assessing sets.

These metrics are often taken to be a direct calculation of the relative difficulty of sets. This impulse should be tempered:

  • Both HSQBRank aPPB and Groger Rank bonus adjustments are derived from regression using the assumption that "adjustments between sets are purely additive"[3][4] - while empirical data has shown that this is approximately true in many cases, this is less true when conversion is above 20 PPB.
  • Adjustments are based on the comparison of IS sets and housewrites. However, there is a huge difference in both field size and composition between the two kinds of set. For instance, in the 2022-23 season roughly the same number of schools on the "top 100" ranking published by Groger Ranks played the housewrites DART III and KICKOFF as IS-215. However, over three times as many total teams played IS-215, meaning that the fraction of top 100 schools at the housewrites was much higher (~70% vs. ~25%). Such a strong selection bias can have significant consequences on the observed performance of the field (and hence the adjustment given to the set), with some loose estimates suggesting it could totally mask the true difficulty of the set.[5]
  • These ranking systems are designed to estimate relative strength of teams and are typically judged by their ability to predict nationals performances or head-to-head match-ups.[6] While they do as well as one could hope for from the data available, none of these systems were intended to describe set difficulty and as such cannot be reasonably be expected to do so with the same degree of accuracy.

There has not been an attempt to "adjustment"-style ranking of college teams in recent memory. Possible reasons include smaller field sizes, a lower number of tournaments to compare, and a general lack of interest.

Middle School Level

At the middle school level, NAQT MS sets are considered the standard for regular difficulty. The lower number of middle school sets mean that difficulty is often pinned to high school sets.

References

  1. Some thoughts on the distribution and regular difficulty by Sen. Estes Kefauver (D-TN) » Sat Nov 13, 2010 9:09 pm
  2. The majority of high school tossups on the Mali Empire are terrible. by Vixor » Wed May 12, 2021 8:06 pm
  3. Liu, Steven. “Groger Ranks 2019-20 Methodology Changes.” Groger Ranks, November 11, 2019. [1].
  4. Morlan, Fred. “FAQ.” HSQBRank, September ?. [2].
  5. Re: Concern for HS Regs+ Sets this Season by Santa Claus » Mon May 08, 2023 3:48 pm
  6. Accuracy of HSQBRank by AKKOLADE » Mon Jan 09, 2012 12:22 pm