Motivation

1. Motivation#

is?Most recent self-driving leaderboards run a policy across a few hundred test routes, average the Driving Scores, and pick the highest number as the winner. While intuitive, this naive averaging has issues, as psychometrics researchers pointed out many years ago. Recent LLM leaderboards have started borrowing the corrections: discarding uninformative items, weighting scores by considering peers’ performance, and allowing ties when the evidence does not support a clear ordering. Self-driving leaderboards face the same problems, and this section makes the case for the same corrections.

1.1. Background#

For a driving policy \(s\) benchmarked on \(N\) routes with per-route scores \(X_{s,n} \in [0, 1]\), the Average Driving Score¹ (ADS) \(\bar{X}_s\) of that policy is defined as

\[\bar{X}_s \;=\; \frac{1}{N} \sum_{n=1}^{N} X_{s,n}.\]

A higher average leads to a higher ranking. Several issues arise from this formulation. First, each route is weighted uniformly, independent of how informative it is. Second, \(\bar{X}_s\) is calculated in isolation from the rest of the leaderboard, which ignores the fact that a score’s meaning depends on what others scored on the same route. The same intuition underlies grading an exam on a Gaussian curve: a 70 earns an A in a class that averaged 50, and a C in one that averaged 90.

Motivated by those issues, we propose a new leaderboard scoring scheme, in which driving policies’ abilities and routes’ difficulties are read off jointly as a posterior over the full response matrix \(X\),

\[p(\text{ability}, \text{difficulty} \mid X) \;\propto\; p(X \mid \text{ability}, \text{difficulty}) \cdot p(\text{ability}, \text{difficulty})\]

In this scheme, abilities and difficulties are estimated together. A policy’s estimated ability rises quickly when it passes routes that the rest of the leaderboard fails, and falls harder when it stumbles on routes that the rest passes. A route’s estimated difficulty, similarly, is more sensitive when a competent policy fails on it or when a weak policy passes it.

Because the output of equation 2 is probabilistic, the ranking comes with a principled uncertainty² measure and confidence intervals. The latter allow for a tie when there is no evidence for a clear winner.

Estimated route difficulties also address evaluation cost: routes that sit far from any submitted policy’s ability contribute almost no ranking signal, opening up a principled path to a smaller route set.

1.2. Benefits#

The joint scheme yields three properties that ADS lacks:

Information-Weighted Aggregation. Weighting each pass by how much it discriminates between policies redirects incentives: pushing from 90 to 100 on an easy route barely moves the rank, while pushing from 40 to 50 on a hard route moves it substantially. Effort flows toward the routes that actually separate the leaderboard.

Informed Benchmark Design. Benchmark design, currently driven by qualitative review of rollout videos, gains a measured target. Decisions to add, drop, or swap routes can be guided by each route’s estimated difficulty and its uncertainty, rather than by authors’ intuitions about what counts as a hard scenario.

Compute-efficient Validation. Once route difficulties are known, a new submission only needs to be run on the routes near its expected ability. That ability is set from a prior and refined as the first outcomes come in. Routes far below it are passed almost surely, and routes far above are failed almost surely, so their outcomes shift the rank little.

1.3. Drawbacks#

While providing many advantages over ADS, the actual implementation of the new ranking scheme requires careful design. A few of the issues that might come up:

Moving Baseline. A policy’s estimated ability is read against the current leaderboard, so its score can shift even when its rollouts do not. A number cited in March can change by June, not because the policy changed, but because newer submissions re-anchored the route difficulties. Reviewers cannot reproduce a cited ranking without knowing the leaderboard snapshot it came from. Publishing versioned snapshots restores reproducibility at the cost of more complexity.

Adversarial Attacks. A policy’s ability depends not only on its own score, but also on others’ performance. An attacker can submit weak decoy policies that fail on a chosen target route. The fit then treats that route as harder than it is, and the attacker’s primary submission, which passes the route, is rewarded with a higher ability estimate. ADS has no analogue, since it never compares across rows. Submission limits and per-team identity verification mitigate this attack vector, at the cost of leaderboard flexibility.

Estimator Sensitivity. Given a score matrix, ADS can be reproduced exactly. The posterior in equation 2 requires many hyper-parameters, and the resulting ranking can shift under alternative hyper-parameters, which also include the random seed. To make matters more complicated, there is no held-out ground-truth ranking to settle the disagreements between hyper-parameter choices.

Motivation

Contents

1. Motivation#

1.1. Background#

1.2. Benefits#

1.3. Drawbacks#

1.4. References#