Leaderboard Design#
Authors: Long Nguyen, Kashyap Chitta
Updated: May 31, 2026
Preface#
This documentation is a technical appendix to the development of a new leaderboard for the NavHard benchmark and the Alpasim competition. It provides the basics needed to understand, develop, and debug the ranking pipeline.
Two research questions form the starting point of this project:
Can we make the benchmarking of E2E driving models more efficient by keeping only the test routes that meaningfully separate policies, and how aggressively can we trim before the rankings break down?
Can we make the ranking more informative by judging each policy not only on its raw Driving Score but also on how its peers fared on the same routes?
This work is still a work in progress.
Structure of this documentation#
This document is organized in three parts. The first part is a meta analysis of the two questions above: Section 2 surveys efficient evaluation strategies, and Section 3 surveys ranking methods. Together they motivate Item Response Theory (IRT) as a unified framework for both. The second part, Sections 4-7, develops the basic IRT machinery. The third part, Sections 8-11, extends the basic models to address shortcomings specific to driving benchmarks.
is?Most recent self-driving leaderboards run a policy across a few hundred test routes, average the Driving Scores, and pick the highest number as the winner. While intuitive, this naive averaging has issues, as psychometrics researchers pointed out many years ago. Recent LLM leaderboards have started borrowing the corrections: discarding uninformative items, weighting scores by considering peers’ performance, and allowing ties when the evidence does not support a clear ordering. Self-driving leaderboards face the same problems, and this section makes the case for the same corrections.
Part I: Meta Analysis#
2 Meta Analysis: Efficient Evaluation.
Running every LLM across tens of thousands of benchmark items is costly. The research line of benchmark compression asks whether a smaller subset of benchmark questions can replace the full benchmark, and reports cost savings of 80 to 99 percent at a few percent error. This section walks through the two archetypal methods of benchmark compression and their open problems.
The psychometrics literature has spent close to a century on how to measure the latent abilities of contestants. This section walks through the history, filters for methods that fit our use case, and points to Item Response Theory as the candidate worth carrying forward.
Part II: Basic IRT#
With Item Response Theory chosen as the candidate, this section examines the flavors of IRT that are relevant for self-driving. Each of them treats the leaderboard as a response matrix, assigns each driving policy an ability and each route a difficulty, and fits a posterior over the latents to the data.
Virtually all IRT flavors have no closed-form solution, so numerical methods are needed to approximate the posterior. The two prominent options are stochastic variational inference (SVI) and Markov chain Monte Carlo (MCMC).
6 IRT: Uncertainty Quantification.
Posterior estimation comes with uncertainty for free: the variational \(\sigma\) of the SVI guides, or the percentile intervals of MCMC draws. This section covers further recipes that are orthogonal or complementary to the posterior uncertainty: (1) non-parametric and parametric bootstraps target the data-side question (what if the data had been different); (2) seed variance targets the training side (how robust are our estimator and implementation).
Validating a ranking model is hard. An image classifier can be scored on a held-out test set. A ranking model has no such anchor: no external true ordering exists to compare against. This section lays out three diagnostics that work around that gap without relying on human preference.
Part III: Extensions#
A CI on each ability is not a CI on each rank. This section points to several options for measuring rank uncertainty from ability uncertainty.
9 Extension: Unbounded Response.
Basic IRT methods assume the response sits inside a bounded interval. Several AV safety metrics, however, are unbounded. Two adaptations show up in practice: the Log-normal IRT and Poisson IRT models, both of which sidestep the need for metric normalization and provide a ranking from raw driving score input.
Score distributions in end-to-end self-driving carry a sharp spike at zero, the footprint of routes where the driving policy failed catastrophically, and a smaller spike at one for clean passes. Continuous flavors of IRT cannot put finite mass on a single point, so the fix is to peel each spike off into its own branch and let a vanilla IRT describe the interior. This section covers the GRM-style zero-and-one-inflated mixture, after Molenaar (2022).
11 Extension: Multi-Dimensional Skills.
Driving is not a single skill: a driving policy that is excellent at lane-keeping may still crash at every intersection, and a unidimensional ability cannot represent that. This section points to several ways to model multi-dimensional skills and to construct a leaderboard out of vector-valued abilities.