Meta Analysis: Ranking

3. Meta Analysis: Ranking#

The psychometrics literature has spent close to a century on how to measure the latent abilities of contestants. This section walks through the history, filters for methods that fit our use case, and points to Item Response Theory as the candidate worth carrying forward.

3.1. Overview#

Joint ranking targets two linked problems on the same data. Given many subjects (driving policies) evaluated on many items (routes), we want to reliably order the subjects by ability and the items by difficulty. As discussed in the motivation, naive averaging of the response matrix has well-known drawbacks. The fix is to model subjects and items jointly, and the joint-ranking idea is older and broader than self-driving.

Across fields, each generation picked a different slice of the design space:

  • Response type: Pairwise match outcomes (chess, LLM arenas) versus per-item continuous scores (educational tests).

  • Update mode: Online ratings that refresh after every game (Elo, TrueSkill) versus offline batch fits over a fixed matrix (IRT, Bradley-Terry).

  • Item modelling: Subjects only (chess) versus joint subject-and-item (educational tests).

Two branches dominate the history, sitting at opposite corners of those axes:

  1. Educational testing picked offline, joint, and per-item. Thurstone [2] and Rasch [3] had to compare students who answered different items on a fixed test with no head-to-head matches, and identifiability of ability versus item difficulty became the foundational question. The joint subject-and-item structure they built is the direct ancestor of every IRT model used today.

  2. Chess and competitive sport picked the opposite slice: outcomes are pairwise, subject-only, and arrive one game at a time. Arpad Elo’s 1960 chess rating set the template, and almost every later online-rating scheme has been a variation on it, including Microsoft’s TrueSkill for online matchmaking and FIFA’s football rankings.

Machine learning has revisited both branches. The 2010s reused the educational slice for ML benchmarks, with models in the role of students [4]. The LLM arenas of the 2020s swung back toward the chess slice through Chatbot Arena’s Bradley-Terry fit on millions of pairwise crowd-sourced votes [5].

Self-driving evaluation lands firmly on the educational side: continuous per-route scores, a fixed offline matrix, joint subject-and-item modelling, and no head-to-head match stream between driving policies. Those constraints reduce the number of relevant methods sharply. The entire pairwise branch (Thurstone’s 1927 Gaussian percepts [2], Bradley-Terry’s logistic match outcomes [6], all the way up to Chatbot Arena’s Elo-style rankings [5]) drops out. The multiple-choice IRT extensions, 3PL with a guessing floor and 4PL with a slipping ceiling, drop out too, since their latents have no analogue when the response is a continuous safety score on a route. What we are left with is a small family of joint subject-and-item models:

Year

Method

Short abstract

1960. Rasch [3]

1PL

Joint subject-and-item logistic model with a single difficulty \(b_n\) per item and a single ability \(\theta_s\) per subject. Ability and difficulty live on the same scale, the specific objectivity property Rasch built the model around: comparisons between subjects should not depend on which items were used.

1968. Birnbaum [7]

2PL

Adds a per-item discrimination \(\alpha_n\) that scales the ability-difficulty gap before the logistic. Different items separate strong from weak subjects with different sharpness. The 2PL chapter in Lord and Novick’s 1968 volume became the founding text of modern psychometrics.

2007. Noel and Dauvier [8]

Beta-IRT

Drops 1PL’s Bernoulli for a Beta likelihood so the response can stay continuous in \((0, 1)\) instead of being binarised at a hand-picked threshold.

2019-26. ML / LLM evaluation [1, 4, 9, 10]

IRT at scale

Ports IRT to ML classifier and LLM benchmarks (models as subjects, benchmark items as items). Brings amortised and variational fitting through py-irt [11].

3.2. Research Insights#

A few cautionary findings from neighbouring fields that any leaderboard designer should keep in mind, regardless of which model is used.

Leaderboard fragility. Ranked tables look more authoritative than they really are. Goldstein and Spiegelhalter [12], working on hospital and school league tables in the 1990s, showed that many adjacent ranks were statistically indistinguishable, and small sample sizes could push the published order far from the true one. The same holds for driving policies: a leaderboard reported without per-rank uncertainty invites over-interpretation, regardless of which model produced it.

Sample-size frontier. Richer ranking models need more data, and there is a knee. Schroeders and Gnambs [13] collate simulation evidence that simple subject-only models stabilise from a few dozen subjects, while joint subject-and-item models need the low hundreds before the per-item parameters are identifiable. Past that knee, adding more routes helps less than adding more driving policies.

Multidimensionality. A single-number ranking implicitly assumes there is one ability axis. Crișan et al. [14] simulated what happens when the truth has two axes, say comfort and safety, but the model fits only one. The ranks themselves stay roughly right, but their reported uncertainty silently collapses, and the textbook fix of dropping misfitting routes often makes things worse, not better. “Everything looks fine on the leaderboard” is not evidence that a one-dimensional ability score captures what the driving policies actually differ on.

Harness sensitivity. The evaluation pipeline is part of the ranking. Alzahrani et al. [15] showed that LLM leaderboards move up to eight rank positions under purely mechanical changes to the evaluation harness: option order in multiple choice, the scoring rule, or the answer-extraction regex, none of which touch the underlying model. The driving analogue is the choice of metric formula, collision threshold, and route aggregation rule: these are part of the ranking too, not external observations, and any model fit on top inherits whatever biases those choices baked in.

3.3. References#

[1]

F. M. Polo and others. tinyBenchmarks: evaluating LLMs with fewer examples. In ICML. 2024.

[2] (1,2)

L. L. Thurstone. A law of comparative judgment. Psychological Review, 1927.

[3] (1,2)

G. Rasch. Probabilistic Models for Some Intelligence and Attainment Tests. Danish Institute for Educational Research, 1960.

[4] (1,2)

F. Martínez-Plumed and others. Item response theory in AI: analysing machine learning classifiers at the instance level. Artificial Intelligence, 2019.

[5] (1,2)

W.-L. Chiang and others. Chatbot Arena: an open platform for evaluating LLMs by human preference. In ICML. 2024. arXiv:2403.04132.

[6]

R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs. Biometrika, 1952.

[7]

A. Birnbaum. Some latent trait models. In F. M. Lord and M. R. Novick, editors, Statistical Theories of Mental Test Scores. Addison-Wesley, 1968.

[8]

Y. Noel and B. Dauvier. A beta item response model for continuous bounded responses. Applied Psychological Measurement, 31(1):47–73, 2007.

[9]

J. P. Lalor, H. Wu, and H. Yu. Learning latent parameters without human response patterns. In EMNLP. 2019. Code: nd-ball/py-irt.

[10]

P. Rodriguez and others. Evaluation examples are not equally informative. In ACL. 2021.

[11]

J. P. Lalor and P. Rodriguez. Py-irt: a scalable item response theory library for Python. arXiv preprint arXiv:2203.01282, 2022. Code: nd-ball/py-irt.

[12]

H. Goldstein and D. J. Spiegelhalter. League tables and their limitations: statistical issues in comparisons of institutional performance. Journal of the Royal Statistical Society: Series A (Statistics in Society), 159(3):385–409, 1996.

[13]

U. Schroeders and T. Gnambs. Sample-size planning in item-response theory: a tutorial. Advances in Methods and Practices in Psychological Science, 2025.

[14]

D. R. Crişan, J. N. Tendeiro, and R. R. Meijer. Investigating the practical consequences of model misfit in unidimensional IRT models. Applied Psychological Measurement, 41(6):439–455, 2017. doi:10.1177/0146621617695522.

[15]

N. Alzahrani, H. A. Alyahya, Y. Alnumay, S. Alrashed, S. Alsubaie, Y. Almushayqih, F. Mirza, N. Alotaibi, N. Altwairesh, A. Alowisheq, M. S. Bari, and H. Khan. When benchmarks are targets: revealing the sensitivity of large language model leaderboards. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). 2024. arXiv:2402.01781.