Extension: Ranking Spread

8. Extension: Ranking Spread#

A CI on each ability is not a CI on each rank. This section points to several options for measuring rank uncertainty from ability uncertainty.

8.1. Overview#

Rank is a non-smooth order statistic, so a per-subject \(\sigma_\theta\) does not map cleanly to a per-subject rank CI. Two effects drive the gap. First, gaps decide rank uncertainty, not the absolute \(\sigma\): two subjects \(3\sigma\) apart never swap; two subjects \(0.1\sigma\) apart swap between two random seeds. Second, boundary effects pin the extremes: rank 1 cannot go higher and rank \(S\) cannot go lower, so the ends carry large ability spread but near-zero rank spread, while crowded interior regions shuffle at modest ability spread.

8.2. Per-subject rank CI#

This is the workhorse readout: a lower and upper rank bound per subject. The band is Monte-Carlo: sample \(\theta_s \sim \mathcal{N}(\mu_s, \sigma_s^2)\) for each subject, rank within each draw, and take quantiles per subject. Direct, and scales linearly in the number of draws.

Chatbot Arena reports a CI of this kind and flags pairs as “statistically tied” when their intervals overlap [1]. CI overlap is a conservative significance test: two subjects can have overlapping rank CIs and still pairwise-dominate at well above the 95% level.

8.3. Top-\(k\) inclusion probability#

When the claim is a cut-point (“best policy”, “top-3”, “pass / fail”), report the probability that subject \(s\) falls inside the cut, \(P(\text{rank}_s \le k)\). Under \(0/1\) loss, the Bayes decision is to include \(s\) iff this probability exceeds \(0.5\) [3]. The plausible top-\(k\) set can differ from the \(k\) subjects with the highest point ability.

8.4. Pairwise dominance probability#

The cleanest answer to “does \(i\) beat \(j\)?” goes through the abilities directly, \(P(\theta_i > \theta_j)\), with Bonferroni or Benjamini-Hochberg correction over the \(S(S-1)/2\) pairs. The full matrix is sharper than rank-CI overlap and is what Arena renders as its “A significantly better than B” arrows [1, 4].

8.5. Partial-order diagram for crowded regions#

When a region is too crowded for any total order to be credible, threshold the pairwise matrix at \(1 - \alpha\) and draw the resulting partial order as a Hasse diagram [2, 5]. Clustered subjects appear as incomparable nodes, which is the honest statement that the data do not support a total order over them.

8.6. References#

[1] (1,2)

W.-L. Chiang and others. Chatbot Arena: an open platform for evaluating LLMs by human preference. In ICML. 2024. arXiv:2403.04132.

[2]

H. Goldstein and D. J. Spiegelhalter. League tables and their limitations: statistical issues in comparisons of institutional performance. Journal of the Royal Statistical Society: Series A (Statistics in Society), 159(3):385–409, 1996.

[3]

R. Lin, T. A. Louis, S. M. Paddock, and G. Ridgeway. Loss function based ranking in two-stage, hierarchical models. Bayesian Analysis, 1(4):915–946, 2006.

[4]

A. R. Avelar Menendez, Y. Liu, and X. Dai. Prompt-dependent ranking of large language models with uncertainty quantification. arXiv preprint arXiv:2603.03336, 2026.

[5]

A. F. Barrientos, D. Sen, G. L. Page, and D. B. Dunson. Bayesian inferences on uncertain ranks and orderings: application to ranking players and lineups. Bayesian Analysis, 2023. arXiv:1907.04842.