7. IRT: Goodness of Fit#

Validating a ranking model is hard. An image classifier can be scored on a held-out test set. A ranking model has no such anchor: no external true ordering exists to compare against. This section lays out three diagnostics that work around that gap without relying on human preference.

7.1. Overview#

A fitted IRT model returns ability and difficulty estimates, but it says nothing about whether those estimates explain the data. We have six diagnostics, each filling a gap that the previous one leaves:

Diagnostic

Asks

Output

Limit

Synthetic data recovery

Does the estimator recover ground truth and reproduce the response matrix?

Scatter, RMSE, Spearman, coverage, residual / PIT plots

Validates the algorithm; real-data validity needs the rest

Confident-violation count \(g_{s,n}\)

Does the posterior ordering agree with the responses?

Per-subject and per-item sums

Disjoint-CI gate is conservative

Classification metrics

Does the binarised mean recover the binary outcome matrix?

Accuracy, precision, recall, F1

Mixes label noise and systematic misfit

Continuous violation magnitude \(\tilde g_{s,n}\)

Same as \(g\), graded for continuous \(X \in [0, 1]\)

Per-subject and per-item sums

Loses the sign of the over- vs under-shoot

Unidimensionality check

Is a single latent skill enough?

Residual eigenvalue spectrum, parallel analysis envelope

Counts missed skills, not which they are; see Section 11

Route network connectivity

Can the response matrix be jointly scaled at all?

Number of connected components, minimum edge weight

Pre-fit only; trivial when no entries are missing

Run synthetic recovery first: it is the only setting in which the estimator is checked against ground truth.

7.2. Synthetic data recovery#

Generate from a known \((\theta^\star, \beta^\star, \alpha^\star)\), draw \(X\), refit, and ask whether the truth comes back. A scatter of \(\hat\theta\) against \(\theta^\star\) shows recovery directly; rank-monotonicity is the headline, summarised by Spearman’s \(\rho\).

Three further predictive checks on the same \(X\) catch likelihood-level failures that the recovery scatter misses:

Plot

Catches

Failure shape

Scatter of \(\hat p_{s,n}\) vs \(X_{s,n}\)

Mean-fit accuracy

Off-diagonal mass

Residual hexbin \(X - \hat p\) vs \(\hat p\)

Heteroscedasticity (wrong dispersion model)

Trumpet fan; suggests a per-item \(\phi_n\)

PIT histogram (continuous) or reliability diagram (binary)

Wrong distribution shape

U (too narrow), inverted-U (too wide), left spike (zero-inflation unmodelled)

7.3. Fitness to a binary response matrix#

A cell is a confident violation when the CIs on \(\theta_s\) and \(\beta_n\) do not overlap and the outcome contradicts the implied ordering:

\[g_{s,n} = \mathbb{1}\!\left[q_{97.5}(\theta_s) < q_{2.5}(\beta_n)\right] \cdot \mathbb{1}\!\left[X_{s,n} > 0.5\right] + \mathbb{1}\!\left[q_{2.5}(\theta_s) > q_{97.5}(\beta_n)\right] \cdot \mathbb{1}\!\left[X_{s,n} < 0.5\right].\]

The first term flags a driving policy confidently below the route difficulty that still passes; the second flags the mirror case. The disjoint-CI gate is what makes \(g\) stricter than plain accuracy: it only triggers when the model is sure about the ordering, so a hit is a real contradiction rather than ambiguity.

Readout

Formula

Reads as

Global misfit

\(\frac{1}{SN}\sum_{s,n} g_{s,n}\)

Single goodness-of-fit number; lower is better

Per-subject

\(\sum_n g_{s,n}\)

Aberrant driving policies

Per-item

\(\sum_s g_{s,n}\)

Routes systematically misfitting

\(g_{s,n}\) keeps only cells with confidently-disjoint CIs and discards the rest. Under 1PL, every cell carries a binary prediction \(\hat X_{s,n} = \mathbb{1}[\hat p_{s,n} > 0.5]\) with \(\hat p_{s,n} = \sigma(\hat\theta_s - \hat\beta_n)\), so the matrix recovery can be scored as a classification problem against the observed binary outcomes. Treating \(X_{s,n} = 1\) as the positive class:

Quantity

Formula

Accuracy

\(\frac{1}{SN}\sum_{s,n} \mathbb{1}[\hat X_{s,n} = X_{s,n}]\)

Precision

\(\sum_{s,n} \hat X_{s,n}\, X_{s,n} \,\big/\, \sum_{s,n} \hat X_{s,n}\)

Recall

\(\sum_{s,n} \hat X_{s,n}\, X_{s,n} \,\big/\, \sum_{s,n} X_{s,n}\)

F1

\(2 \cdot \text{Precision} \cdot \text{Recall} \,/\, (\text{Precision} + \text{Recall})\)

Per-subject and per-item versions restrict the sums to a single row or column. A driving policy with low row-accuracy is aberrant; a route with low column-accuracy is misfitting. These flag the same contradiction that \(g\) does (the model said one thing, the outcome said the other), but without the disjoint-CI gate, so they pick up label noise alongside systematic misfit. Use them alongside \(g\), not in place of it: \(g\) isolates confident violations, while the classification metrics summarise how well the binarised mean matches the matrix overall.

7.4. Fitness to a continuous response matrix#

For Beta-IRT, \(X_{s,n} \in [0, 1]\) and the indicator \(\mathbb{1}[X_{s,n} > 0.5]\) in \(g\) collapses most of the signal into one bit. Replace each indicator with the gap between the observation and the model’s expected response \(\hat p_{s,n} = \sigma(\hat\alpha_n \hat\theta_s - \hat\beta_n)\):

\[\begin{split} \begin{aligned} \tilde g_{s,n} = \;& \mathbb{1}\!\left[q_{97.5}(\theta_s) < q_{2.5}(\beta_n)\right] \cdot \max(X_{s,n} - \hat p_{s,n}, 0) \\ + & \mathbb{1}\!\left[q_{2.5}(\theta_s) > q_{97.5}(\beta_n)\right] \cdot \max(\hat p_{s,n} - X_{s,n}, 0). \end{aligned} \end{split}\]

A confident-ordering cell is now contradicted in proportion to how much the score over- or undershoots the prediction, instead of by a single bit. Per-subject and per-item sums carry the same interpretation as for \(g\): aberrant driving policies and misfitting routes, now graded rather than counted.

7.5. Unidimensionality check#

The fit diagnostics above also presume the routes load on a single latent skill. If the routes split into clusters that exercise different skills, a unidim fit collapses each driving policy’s profile into one \(\hat\theta_s\), and the misfit shows up as correlated residuals across items within a cluster rather than as a clean failure of \(g\). Two checks: residual correlation eigenvalues read the eigenvalue spectrum of the item-by-item correlation of cell residuals after a unidim fit, and parallel analysis sets a noise floor by computing the same eigenvalues on column-permuted \(X\). Eigenvalues above the noise floor count missed skills.

Section 11 develops both diagnostics in full and covers the MIRT extensions that take over once the unidim assumption fails.

7.6. Route network connectivity#

All diagnostics above, including the unidimensionality check, assume the response matrix can be scaled jointly in the first place. With missing entries that assumption can fail: if two clusters of routes share no driving policy, IRT scales each cluster on its own latent space, and estimates from one cluster are not comparable to estimates from the other. The network analysis of [1] checks for this before fitting.

Construct a graph \(G = (V, E)\) with routes as nodes. Two routes are connected by an edge if at least one driving policy has a response on both. The edge weight equals the number of such linking policies. Two readouts:

Quantity

Reads as

Number of connected components

More than one means the matrix is unscalable. Each component must be fit separately, and the resulting abilities cannot be placed on a common scale.

Minimum edge weight

A bottleneck route pair linked by only a few policies. Even when the graph is connected, a thin link inflates the uncertainty of estimates that depend on it.

The dual graph (driving policies as nodes, edges between policies sharing a route) carries the same content from the other side, and is the version to inspect when missingness is policy-driven rather than route-driven.

For NavHard with no missing entries the graph is complete and the analysis is a sanity check. The diagnostic is informative once routes are sub-sampled per policy (Section 2) or when policies skip routes for cost or availability reasons.

7.7. References#

[1]

C. Zopluoglu. Zero-and-one inflated IRT models for bounded continuous response data: a tutorial. 2024. URL: https://czopluoglu.github.io/Duolingo_paper/.