11. Extension: Multi-Dimensional Skills#

Driving is not a single skill: a driving policy that is excellent at lane-keeping may still crash at every intersection, and a unidimensional ability cannot represent that. This section points to several ways to model multi-dimensional skills and to construct a leaderboard out of vector-valued abilities.

11.1. Overview#

Every model up to this point compresses driving policy ability into a single scalar \(\theta_s \in \mathbb{R}\). Driving is multi-skill in the obvious way: lane-keeping, intersection negotiation, cut-in handling, low-speed maneuvering, and lane-changing live on largely independent axes, and a driving policy can be strong on one and weak on another. The Bench2Drive closed-loop benchmark already runs on this premise, scoring driving policies separately across 44 interactive scenario categories and compressing the numbers into 5 different driving skills [2], though the categorization requires manual annotation.

Multidimensional IRT (MIRT) [3] addresses this by lifting the latents from scalars to vectors. Two questions follow: how to model \(D > 1\) skills, and, given vector abilities, how to produce a leaderboard at all. There is no canonical total order on \(\mathbb{R}^D\), so the second question is not just a presentation choice.

11.2. Compensatory MIRT#

The smallest change is to let \(\theta_s\) become \(\vec\theta_s \in \mathbb{R}^D\), the per-item discrimination \(\alpha_n\) become a non-negative vector \(\vec\alpha_n \in \mathbb{R}^D_+\), and the per-item difficulty \(\beta_n\) become a vector \(\vec\beta_n \in \mathbb{R}^D\). The 2PL link generalises to

\[P(X_{s,n} = 1 \mid \vec\theta_s, \vec\beta_n, \vec\alpha_n) = \sigma\!\bigl(\vec\alpha_n^\top (\vec\theta_s - \vec\beta_n)\bigr).\]

Strength on one skill compensates for weakness on another. A driving policy with high lane-keeping ability and low intersection ability still gets credit on an intersection route through whatever lane-keeping component the route exercises. The Beta-IRT generalisation follows the same pattern.

11.3. Non-Compensatory MIRT#

Compensatory MIRT lets a high score on one axis make up for a low score on another. For safety-critical driving, that does not match reality: a driving policy that is perfect at lane-keeping but cannot navigate intersections will still crash at every intersection, no matter how good the lane-keeping is. The relevant logical operation between skills is closer to AND than to a weighted sum.

Non-compensatory MIRT enforces the AND geometry by multiplying per-skill pass probabilities [3, 4]:

\[P(X_{s,n} = 1 \mid \vec\theta_s, \vec\beta_n, \vec\alpha_n) = \prod_{d=1}^D \sigma\!\bigl(\alpha_{n,d}(\theta_{s,d} - \beta_{n,d})\bigr).\]

Each route now carries a per-skill difficulty \(\beta_{n,d}\), and the route is passed only when every required skill clears its own threshold. Unlike in the compensatory form, every component of \(\vec\beta_n\) enters the likelihood independently and is identified separately, so all \(D\) thresholds per route must be estimated, which is harder when the per-route subject count is small.

11.4. How Many Skills?#

Both compensatory and non-compensatory MIRT need a choice of \(D\) before anything else. Pick \(D\) too small and the missed skills get squashed into the one axis we kept, blurring everyone’s score. Pick \(D\) too large and the extra axes are interchangeable rotations of each other with no honest interpretation. Two diagnostics on the data we already have can narrow \(D\) down before fitting any MIRT, plus a third predictive check that runs once an MIRT is fitted.

11.4.1. Residual Correlation Eigenvalues#

Fit a one-skill model. For every (subject, item) cell, the model predicts a pass probability \(\hat{p}_{s,n}\), and the residual \(r_{s,n} = X_{s,n} - \hat{p}_{s,n}\) is how wrong that prediction was. If one skill really explains everything, residuals are just noise: each item’s wrongness has nothing to do with any other item’s wrongness. But if there is a second skill we missed (say “intersection-handling”), the items that exercise it will have correlated wrongnesses: subjects strong on intersections overshoot together, weak ones undershoot together. Eigendecomposing the item-by-item correlation of residuals finds the strongest such co-variation patterns and ranks them. The first eigenvalue \(\lambda_1\) is the strength of the strongest pattern, \(\lambda_2\) the next, and so on. Eigenvalues clearly above the noise floor are missed skills.

11.4.2. Parallel Analysis#

We just said “above the noise floor” without saying how big the noise floor is. Parallel analysis measures it directly. Take the real \(X\), destroy any real cross-item structure by shuffling, compute the eigenvalues on the shuffled data, and repeat the shuffle many times. The shuffled eigenvalues are what the spectrum looks like when there is nothing real to find. Anything in the real data that beats the shuffled envelope is a real factor; anything below is noise [5].

The shuffle has to break correlations between items while leaving each item’s marginal alone. The way to do that is to permute each column of \(X\) independently. Each item keeps its original pass rate (easy items stay easy, hard items stay hard), but the link between subject \(s\)’s row and item \(n\)’s row is randomised, killing any pattern that connected items to one another. Shuffling rows would do the opposite: it preserves cross-item structure, so the row-shuffled eigenvalues would look just like the real ones and the diagnostic would say nothing.

11.4.3. Reading the Two Together#

Residual eigenvalues diagnose what the unidim fit missed. Parallel analysis diagnoses what is in the raw data before any model is fit. They should agree on the dominant missed factors. They diverge when the unidim fit is partly absorbing one axis (a 2PL discrimination, for instance, can soak up a slope-like factor that parallel analysis on raw \(X\) still flags as a separate eigenvalue). When that happens, take parallel analysis on continuous \(X\) as the upper bound on \(D\), and the residual diagnostic on the most flexible unidim baseline (Beta-1PL) as the lower bound.

11.5. Ranking Vector Abilities#

With \(\vec\theta_s \in \mathbb{R}^D\), no total order is canonical. The leaderboard has to declare what it means by “better”, and that declaration is itself part of the metric. Five options the leaderboard can pick from:

Per-skill leaderboards. Publish \(D\) separate rankings, one per axis. The most transparent option: a driving policy that wins on intersections but loses on highways shows up where it should on each list. The cost is the absence of a headline number.

Pareto dominance. \(\vec\theta_s\) Pareto-dominates \(\vec\theta_{s'}\) iff \(\theta_{s,d} \geq \theta_{s',d}\) for all \(d\) and strictly so for at least one. The undominated set is the Pareto front. This is the honest summary when no scalarisation is available, but it leaves many driving policies pairwise incomparable. NSGA-II-style non-dominated sorting [6] recovers a total preorder by peeling Pareto fronts in succession: front 1 is the global undominated set, front 2 is the undominated set after removing front 1, and so on.

Linear scalarisation. Pick weights \(\vec w \in \Delta^{D-1}\) and rank by \(\vec w^\top \vec\theta_s\). This reduces vectors to scalars, but the choice of \(\vec w\) is editorial rather than data-driven. Different stakeholders justify different \(\vec w\) (a city pilot weighs intersections more than a highway autopilot does), so any single linearly-scalarised leaderboard is one of many. Publishing \(\vec w\) alongside the ranking is essential. Hiding it inside a black-box aggregator imports the loss-dependence ambiguity flagged by Lin, Louis, Paddock, and Ridgeway [1] into a setting where it is harder to spot.

Lexicographic order. Rank by \(\theta_{s,1}\), break ties by \(\theta_{s,2}\), and so on. Useful when one skill is a hard prerequisite (a driving policy that fails on safety is not a valid candidate regardless of comfort). This is equivalent to a near-degenerate \(\vec w\) with mass concentrated on the top axis and infinitesimal mass on the rest.

Posterior probabilistic dominance. Use the joint posterior \(p(\vec\theta_s, \vec\theta_{s'} \mid X)\) and report \(\Pr(\vec\theta_s \succeq \vec\theta_{s'})\) over Pareto draws, computed as the fraction of joint posterior samples in which \(s\) dominates \(s'\) [7]. The leaderboard declares a tie when no candidate dominates with high posterior mass; this is the same disjoint-interval logic that produces ties under a scalar ability, applied to a partial order.

11.6. Interpretability#

Three of the five strategies above (per-skill leaderboards, linear scalarisation, and lexicographic order) presume that “axis 1 is intersection skill, axis 2 is highway skill” is a meaningful statement. By default it is not; there are two standard fixes:

Confirmatory MIRT. A binary matrix \(Q \in \{0, 1\}^{N \times D}\) encodes which skills each route exercises, set by domain knowledge: \(Q_{n,d} = 1\) if route \(n\) requires skill \(d\). The corresponding \(\alpha_{n,d}\) is forced to zero whenever \(Q_{n,d} = 0\). The construction goes back to Tatsuoka’s rule-space framework [8]. Sufficient sparsity in \(Q\) pins the rotation, and each surviving axis means what its column of \(Q\) says it means.

Exploratory MIRT. Leave the discriminations unconstrained and interpret the recovered axes post-hoc, with no guarantee they will line up with nameable skills [3].

11.7. References#

[1]

R. Lin, T. A. Louis, S. M. Paddock, and G. Ridgeway. Loss function based ranking in two-stage, hierarchical models. Bayesian Analysis, 1(4):915–946, 2006.

[2]

X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan. Bench2Drive: towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. In NeurIPS Datasets and Benchmarks Track. 2024. arXiv:2406.03877.

[3] (1,2,3,4)

M. D. Reckase. Multidimensional Item Response Theory. Statistics for Social and Behavioral Sciences. Springer, 2009. doi:10.1007/978-0-387-89976-3.

[4]

D. M. Bolt and V. F. Lall. Estimation of compensatory and noncompensatory multidimensional item response models using Markov chain Monte Carlo. Applied Psychological Measurement, 27(6):395–414, 2003. doi:10.1177/0146621603258350.

[5]

J. L. Horn. A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2):179–185, 1965. doi:10.1007/BF02289447.

[6]

K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2):182–197, 2002. doi:10.1109/4235.996017.

[7]

S. Rojas-Gonzalez, J. Branke, and I. Van Nieuwenhuyse. Efficient computation of probabilistic dominance in multi-objective optimization. ACM Transactions on Evolutionary Learning and Optimization, 2021. doi:10.1145/3469801.

[8]

K. K. Tatsuoka. Rule space: an approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20(4):345–354, 1983. doi:10.1111/j.1745-3984.1983.tb00212.x.