6. IRT: Uncertainty Quantification#
Posterior estimation comes with uncertainty for free: the variational \(\sigma\) of the SVI guides, or the percentile intervals of MCMC draws. This section covers further recipes that are orthogonal or complementary to the posterior uncertainty: (1) non-parametric and parametric bootstraps target the data-side question (what if the data had been different); (2) seed variance targets the training side (how robust are our estimator and implementation).
6.1. Overview#
Recipe |
Fitter |
What it uses |
Cost |
Assumption |
Notes |
|---|---|---|---|---|---|
Posterior percentile / HDI |
MCMC |
MCMC draws |
Expensive MCMC sampling |
Exact in the sample limit |
Gold standard |
Variational per-factor \(\sigma\) |
SVI |
Guide parameters |
Free, from the fit |
Factors are independent |
Baseline; can under-report spread |
Non-parametric bootstrap |
Any |
Resample items or subjects |
\(B \times 1\) fits |
Rows / columns are i.i.d. |
Model-agnostic; cheap when wrapped around an SVI fit |
Parametric bootstrap |
Any |
Simulate from fitted model |
\(B \times 1\) fits |
Model is the truth |
Sensitive to misspecification; usually dominated by non-parametric |
Seed variance |
Any |
Refits over many seeds, aggregated after linking |
\(N_\text{seed}\) fits |
Seeds proxy training-side noise |
Orthogonal axis: captures optimiser noise |
6.2. Posterior credible intervals (MCMC)#
Given \(T\) MCMC draws \(\{\theta^{(t)}\}_{t=1}^T\), every subject’s posterior is already available as a sample cloud, so the credible interval is a quantile read off from the draws:
The percentile CI puts \(2.5\%\) in each tail and is invariant to monotone reparameterisation. The HDI is the shortest interval covering the same mass and tracks the asymmetry of skewed posteriors (e.g. \(\alpha_n\) under LogNormal). Report the HDI when the posterior is asymmetric; report the percentile CI when reparameterisation-invariance matters.
6.3. Variational per-factor \(\sigma\)#
Mean-field SVI fits a guide \(q_\phi = \prod_k \mathcal{N}(\mu_k, \sigma_k)\) (or LogNormal / logit-Normal factors as the support requires). Each factor is already parameterised by a mean \(\mu_k\) and standard deviation \(\sigma_k\), so the credible interval on \(\hat\theta_s\) falls out as the Gaussian one centred at the fitted mean:
The catch is that mean-field ignores posterior correlations and is known to under-report the spread when those correlations are non-trivial [2].
6.4. Non-parametric bootstrap#
Resample rows or columns of \(X\) with replacement, refit the model on the resampled data, and collect the replicated estimates. The recipe is model-agnostic: it does not rely on the likelihood form, only on the i.i.d. assumption on the resample axis.
Algorithm (item bootstrap). For \(b = 1, \dots, B\):
Draw \(N\) indices \(i_1, \dots, i_N\) i.i.d. uniform from \(\{1, \dots, N\}\).
Refit on the column-subset \(X[:, i_{1:N}]\) to obtain \(\hat\theta^{(b)}\).
Resample items (columns) to answer “what if the benchmark had drawn a different set of routes?”. Resample subjects (rows) to answer “what if the model pool had been different?”. LMSYS Chatbot Arena uses an item bootstrap with \(B = 100\) for its LLM leaderboard intervals [1]; \(500\) or more is preferred for paper-grade intervals.
The cost is \(B\) full refits, which with SVI at seconds per fit is within budget. The interval becomes noisy when \(N\) is small, and items duplicated within a draw lose information; both push toward a larger \(B\) for stability rather than richer inference.
6.5. Parametric bootstrap#
Resimulate responses from the fitted model, refit on the simulated data, and collect the replicated estimates. The fitted parameters \((\hat\theta, \hat\beta, \hat\alpha)\) play the role of the truth.
Algorithm. For \(b = 1, \dots, B\):
Simulate \(X^{(b)}_{s,n} \sim p(\cdot \mid \hat\theta_s, \hat\beta_n, \hat\alpha_n)\) for every \((s, n)\).
Refit the model on \(X^{(b)}\) to obtain a new estimate \(\hat\theta^{(b)}\).
The replicate set \(\{\hat\theta^{(b)}\}_{b=1}^B\) yields per-subject percentile CIs:
Useful when the data is too sparse for column resampling to be reliable, or when we trust the likelihood and want to propagate its noise model into the interval. The cost is the same \(B\) refits as the non-parametric version, plus a simulator. The non-parametric version dominates when we do not want to bet on the likelihood being correctly specified, since the parametric bootstrap inherits any model misspecification as bias in its intervals.
6.6. Seed ensembling#
The first four recipes target data-side uncertainty (bootstraps) or posterior uncertainty given a single fit (variational \(\sigma\), MCMC). Seed variance targets a different axis: the randomness introduced by the training procedure itself (initialisation, SGD/Adam trajectory, data shuffling). Running \(N_\text{seed}\) fits and reading the empirical spread of \(\hat\theta^{(s)}\) captures this directly.
Algorithm. For \(s = 1, \dots, N_\text{seed}\):
Refit the model on the full \(X\) with seed \(s\).
Centre each seed’s \(\hat\theta\) and \(\hat\beta\) by the same per-seed constant (with \(\alpha\) fixed, the continuous shift is the only symmetry).
Per subject, report mean and spread over linked abilities.
Treat seed variance as a complementary diagnostic alongside one of the first four recipes. If the seed spread exceeds the posterior CI width, the optimiser is the dominant noise source and the posterior-only intervals understate the real uncertainty. If the seed spread is small, single-seed reporting is defensible. Seeds are not a draw from any distribution; report the result as “spread across seeds”, not as a “95% CI” [3, 4].
6.7. References#
W.-L. Chiang and others. Chatbot Arena: an open platform for evaluating LLMs by human preference. In ICML. 2024. arXiv:2403.04132.
D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: a review for statisticians. JASA, 2017. arXiv:1601.00670.
X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk, J. Szeto, N. Mohammadi Sepahvand, E. Raff, K. Madan, V. Voleti, S. Ebrahimi Kahou, V. Michalski, T. Arbel, C. Pal, G. Varoquaux, and P. Vincent. Accounting for variance in machine learning benchmarks. In MLSys. 2021. arXiv:2103.03098.
D. Picard. Torch.manual_seed(3407) is all you need: on the influence of random seeds in deep learning architectures for computer vision. arXiv preprint arXiv:2109.08203, 2021.