4. IRT: Mathematical Models#

With Item Response Theory chosen as the candidate, this section examines the flavors of IRT that are relevant for self-driving. Each of them treats the leaderboard as a response matrix, assigns each driving policy an ability and each route a difficulty, and fits a posterior over the latents to the data.

4.1. Overview#

The starting point of IRT is a response matrix \(X\) of shape \(S \times N\), with \(X_{s,n}\) the score subject \(s\) obtained on item \(n\). Every model in this section builds on that same picture, and only two axes separate them: (1) the latent parameters they model, and (2) the link function that ties those latents to the observed score. The two tables below lay out both axes side by side.

The latent parameters describe the underlying quantities that we try to fit from the response matrix:

Symbol

Name

Per

Means

Used By

\(\theta_s\)

Ability

Subject

How good subject \(s\) is

All

\(\beta_n\)

Difficulty

Item

How hard item \(n\) is

All

\(\alpha_n\)

Discrimination

Item

How sharply item \(n\) separates strong from weak

2PL, Beta

\(\phi\)

Dispersion

Global or item

How concentrated the Beta sits around its mean

Beta

The noise model \(p(X_{s,n} \mid \text{latents})\) links the latents to the responses:

Model

Response support

Noise model

Link to latents

1PL

\(\{0, 1\}\)

Bernoulli

\(\sigma(\theta_s - \beta_n)\)

2PL

\(\{0, 1\}\)

Bernoulli

\(\sigma(\alpha_n (\theta_s - \beta_n))\)

Beta

\((0, 1)\)

Beta

mean \(\mu_{s,n} = \sigma(\theta_s - \beta_n)\), precision \(\phi\)

To fit the latents to the data, maximum likelihood is the obvious first try: pick the parameters that make the observed answers most probable. This approach, however, breaks on IRT. Take 1PL for example: a perfect scorer pushes \(\hat\theta\) to infinity. Furthermore, shifting all \(\theta\) and \(\beta\) up by the same constant leaves the likelihood unchanged.

The Bayesian approach fixes this issue by providing prior distributions that the latents should respect. In particular, it treats the latents \((\theta, \beta, \alpha)\) as random variables and combines the likelihood with priors through Bayes’ rule to find the posterior:

\[\underbrace{p(\theta, \beta, \alpha \mid X)}_{\text{posterior}} \;\propto\; \underbrace{p(X \mid \theta, \beta, \alpha)}_{\text{likelihood}} \cdot \underbrace{p(\theta)\, p(\beta)\, p(\alpha)}_{\text{prior}}.\]

The two right-hand factors pull in different directions. The likelihood pulls the latents toward whatever makes the observed scores most probable; the prior pulls them toward whatever was plausible before any data arrived, making the optimization a well-posed problem. Two types of priors show up in practice:

Family

Idea

Example

Vague

Hyperparameters are fixed and deliberately wide; every latent stands on its own a priori

\(\theta_s \sim \mathcal{N}(0, 1)\), \(\beta_n \sim \mathcal{N}(0, 10^3)\)

Hierarchical

Hyperparameters are themselves random variables; latents in the same group pool through them

\(\theta_s \sim \mathcal{N}(\mu_\theta, 1/u_\theta)\), with \(\mu_\theta \sim \mathcal{N}(0, 10^6)\) and \(u_\theta \sim \mathrm{Gamma}(1, 1)\)

Once we have a reasonable estimate of the posterior, the following quantities can be read off:

Quantity

Definition

Use

Mean

\(\mathbb{E}[\theta_s \mid X]\)

The scalar that orders subjects by ability

Spread

Quantile range (e.g. 5th-95th percentile) or HDI (shortest interval covering a given mass, e.g. 90%)

Error bar on \(\hat\theta_s\)

Samples

\(T\) joint draws \(\{\theta^{(t)}, \beta^{(t)}, \alpha^{(t)}\}_{t=1}^T\) from the posterior (\(T \sim 10^3\) to \(10^4\))

The samples give a distribution of rankings. Each draw orders subjects differently, so counting across draws answers questions like “how often is \(s\) in the top \(k\)?” or “how often does \(s\) beat \(s'\)?”

4.2. One-parameter logistic (1PL)#

1PL is the original IRT model and the cleanest expression of the core idea: ability and difficulty live on the same axis, and the only thing that drives a response is which one is larger. Each subject gets a single ability \(\theta_s\), each item a single difficulty \(\beta_n\), and a subject passes if its ability beats the item’s difficulty. With only two latents per cell, 1PL is also one of the most stable choices: fewer parameters means less to identify.

The likelihood is Bernoulli with a logistic link:

\[P(X_{s,n} = 1 \mid \theta_s, \beta_n) = \sigma(\theta_s - \beta_n), \qquad \sigma(z) = \frac{1}{1 + e^{-z}}.\]

A positive gap \(\theta_s - \beta_n\) pushes the pass probability above \(0.5\); a negative gap pushes it below.

Because the likelihood is Bernoulli, 1PL only accepts binary responses directly. Continuous scores in \([0, 1]\), such as EPDMS, need to be binarised at a threshold \(\tau\) first, which adds a hyperparameter that has to be justified.

Component

1PL

Vague priors

\(\theta_s \sim \mathcal{N}(0, 1)\)
\(\beta_n \sim \mathcal{N}(0, 10^3)\)

Hierarchical priors

\(\mu_\theta, \mu_\beta \sim \mathcal{N}(0, 10^6)\)
\(u_\theta, u_\beta \sim \mathrm{Gamma}(1, 1)\)
\(\theta_s \sim \mathcal{N}(\mu_\theta, 1/u_\theta)\)
\(\beta_n \sim \mathcal{N}(\mu_\beta, 1/u_\beta)\)

Point estimates

\(\hat\theta_s = \mu_{\theta,s}\)
\(\hat\beta_n = \mu_{\beta,n}\)

Posterior uncertainty

\(\theta_s\): \(\sigma_{\theta,s}\)
\(\beta_n\): \(\sigma_{\beta,n}\)

Use when

\(S\) in the tens: fewer parameters, more stable [3]

4.3. Two-parameter logistic (2PL)#

1PL gives every item the same slope. 2PL [1] adds a per-item discrimination \(\alpha_n\) that scales the ability-difficulty gap before the sigmoid:

\[P(X_{s,n} = 1 \mid \theta_s, \beta_n, \alpha_n) = \sigma\!\bigl(\alpha_n (\theta_s - \beta_n)\bigr).\]

Larger \(\alpha_n\) makes item \(n\) more decisive (steeper ICC); smaller \(\alpha_n\) makes it noisier. Discrimination must be positive (a strong subject should not pass less often than a weak one), so it is sampled in log space, which gives a LogNormal prior on \(\alpha_n\) itself.

Component

2PL

Vague priors

\(\theta_s \sim \mathcal{N}(0, 1)\)
\(\beta_n \sim \mathcal{N}(0, 10^3)\)
\(\log \alpha_n \sim \mathcal{N}(0, 1)\)

Hierarchical priors

\(\mu_\theta, \mu_\beta, \mu_{\log \alpha} \sim \mathcal{N}(0, 10^6)\)
\(u_\theta, u_\beta, u_{\log \alpha} \sim \mathrm{Gamma}(1, 1)\)
\(\theta_s \sim \mathcal{N}(\mu_\theta, 1/u_\theta)\)
\(\beta_n \sim \mathcal{N}(\mu_\beta, 1/u_\beta)\)
\(\log \alpha_n \sim \mathcal{N}(\mu_{\log \alpha}, 1/u_{\log \alpha})\)

Point estimates

\(\hat\theta_s = \mu_{\theta,s}\)
\(\hat\beta_n = \mu_{\beta,n}\)
\(\hat \alpha_n = \exp(\mu_{\log \alpha, n})\)

Posterior uncertainty

\(\theta_s\): \(\sigma_{\theta,s}\)
\(\beta_n\): \(\sigma_{\beta,n}\)
\(\alpha_n\) (delta method): \(\hat \alpha_n \cdot \sigma_{\log \alpha, n}\)

Use when

items are heterogeneous in how cleanly they separate strong from weak subjects, and \(S\) is large enough to identify a per-item slope (typically \(S\) in the low hundreds)

4.4. Beta-IRT#

Beta-IRT [2] is the smallest change to 1PL/2PL that lets the response stay continuous. The latent geometry is identical to the logistic models: an ability-difficulty gap drives the expected score, and only the noise model swaps. The Bernoulli on a binary outcome is replaced with a Beta on \(X_{s,n}\) itself, so partial credit lands somewhere in \((0, 1)\) instead of being squeezed through a threshold.

We follow the Molenaar parameterisation [4], which writes the Beta directly in its shape parameters and pushes the dispersion onto an item-level log-precision \(o_n\):

\[ X_{s,n} \sim \mathrm{Beta}(a_{s,n}, b_{s,n}), \qquad a_{s,n} = \exp\!\bigl(\tfrac{1}{2}(\eta_{s,n} + o_n)\bigr), \qquad b_{s,n} = \exp\!\bigl(\tfrac{1}{2}(-\eta_{s,n} + o_n)\bigr), \qquad \eta_{s,n} := \alpha_n \theta_s - \beta_n. \]

The shape parameters look ad hoc but are reverse-engineered so the Beta mean reduces to the 2PL ICC:

\[ \mathbb{E}[X_{s,n}] = \frac{a_{s,n}}{a_{s,n} + b_{s,n}} = \sigma(\eta_{s,n}) = \sigma\!\bigl(\alpha_n \theta_s - \beta_n\bigr). \]

The precision \(o_n\) drops out of the mean and only controls dispersion. The natural-scale precision \(o'_n = \exp(o_n / 2)\) plays the role of \(\phi\) in the (mean, precision) parameterisation: large \(o'_n\) concentrates the Beta around \(\sigma(\eta_{s,n})\), while \(o'_n < 1\) drives it toward a U-shape. Holding \(o_n\) per item is what lets a hard route widen its own noise without dragging the global precision down. Clamping \(\alpha_n = 1\) for every item collapses the model to a 1PL-style fit; otherwise \(\alpha_n\) is learned in log space alongside the rest of the latents.

Component

Beta-IRT (Molenaar)

Vague priors

\(\theta_s \sim \mathcal{N}(0, 1)\)
\(\beta_n \sim \mathcal{N}(0, 10^3)\)
\(\log \alpha_n \sim \mathcal{N}(0, 1)\) (omit if discrimination is fixed)
\(o_n \sim \mathcal{N}(0, 1)\)

Hierarchical priors

\(\mu_\theta, \mu_\beta \sim \mathcal{N}(0, 10^6)\)
\(u_\theta, u_\beta \sim \mathrm{Gamma}(1, 1)\)
\(\theta_s \sim \mathcal{N}(\mu_\theta, 1/u_\theta)\)
\(\beta_n \sim \mathcal{N}(\mu_\beta, 1/u_\beta)\)
\(\log \alpha_n, o_n \sim \mathcal{N}(0, 1)\) (no hyperprior on discrimination or precision)

Point estimates

\(\hat\theta_s = \mu_{\theta,s}\)
\(\hat\beta_n = \mu_{\beta,n}\)
\(\hat\alpha_n = \exp(\mu_{\log\alpha,n})\) (or \(1\) if discrimination is fixed)
\(\hat{o'}_n = \exp(\mu_{o,n} / 2)\)

Posterior uncertainty

\(\theta_s\): \(\sigma_{\theta,s}\)
\(\beta_n\): \(\sigma_{\beta,n}\)
\(\alpha_n\) (delta method): \(\hat\alpha_n \cdot \sigma_{\log\alpha,n}\)
\(o'_n\) (delta method): \(\tfrac{1}{2}\hat{o'}_n \cdot \sigma_{o,n}\)

Use when

continuous responses in \((0, 1)\) where 2PL-style ordering captures the signal; pair with the zero/one inflation in Section 10 when scores spike at the boundaries [4, 5]

4.5. References#

[1]

A. Birnbaum. Some latent trait models. In F. M. Lord and M. R. Novick, editors, Statistical Theories of Mental Test Scores. Addison-Wesley, 1968.

[2]

Y. Noel and B. Dauvier. A beta item response model for continuous bounded responses. Applied Psychological Measurement, 31(1):47–73, 2007.

[3]

U. Schroeders and T. Gnambs. Sample-size planning in item-response theory: a tutorial. Advances in Methods and Practices in Psychological Science, 2025.

[4] (1,2)

D. Molenaar, M. Cúri, and J. L. Bazán. Zero and one inflated item response theory models for bounded continuous data. Journal of Educational and Behavioral Statistics, 47(6):693–735, 2022.

[5]

C. Zopluoglu and J. R. Lockwood. A comparative study of item response theory models for mixed discrete-continuous responses. Journal of Intelligence, 12(3):26, 2024.