IRT: Posterior Estimation

5. IRT: Posterior Estimation#

Virtually all IRT flavors have no closed-form solution, so numerical methods are needed to approximate the posterior. The two prominent options are stochastic variational inference (SVI) and Markov chain Monte Carlo (MCMC).

5.1. Overview#

The IRT posterior \(p(\theta, \beta, \alpha \mid X)\) has no closed form for any of the IRT models. Two algorithm families approximate it, sitting at opposite ends of a speed-fidelity tradeoff: stochastic variational inference (SVI) trades a controlled bias for fast fitting; Markov chain Monte Carlo (MCMC) is unbiased in the sample limit, but sampling can take up to hours.

Algorithm	Family	Output	Wall clock	Bias	Use when
Mean-field SVI	variational	per-latent mean and SE	seconds	projects onto the chosen family; loses correlations; underestimates spread	default; large \(S, N\); first-pass estimate
MCMC	sampling	\(T\) joint draws	minutes to hours	none in the sample limit; cost scales with \(S \cdot N\)	\(S\) in the hundreds; calibrated intervals or rank distributions needed

5.2. Stochastic Variational Inference (SVI)#

SVI turns posterior inference into optimisation. Pick a tractable family of distributions \(q_\phi\), called the guide, and tune its parameters \(\phi\) until \(q_\phi\) is as close to the true posterior \(p\) as the family allows. The popular choice of guide used here is the mean-field guide: factorise the guide into an independent product over the latents,

\[q_\phi(\theta, \beta, \alpha, \dots) \;=\; \prod_s q_{\phi_{\theta,s}}(\theta_s) \;\prod_n q_{\phi_{\beta,n}}(\beta_n) \;\prod_n q_{\phi_{\alpha,n}}(\alpha_n) \;\cdots,\]

with each factor matched to its latent’s support: Normal on \(\mathbb{R}\), logit-Normal on \((0, 1)\), LogNormal on \((0, \infty)\). Each factor is parameterised by \((\mu_k, \sigma_k)\). Point estimates fall out as the fitted means, \(\hat\theta_s = \mu_{\theta,s}\) and so on; the companion \(\sigma_k\) is the posterior standard deviation of the latent \(k\) under the guide.

The closeness measure is the Evidence Lower Bound (ELBO), the sum of two terms over every observation and every latent:

\[\text{ELBO}(\phi) = \underbrace{\sum_{s, n} \mathbb{E}_{q_\phi}[\log p(X_{s,n} \mid \theta_s, \beta_n, \alpha_n)]}_{\text{data fit}} \;-\; \underbrace{\sum_k \mathrm{KL}\bigl(q_{\phi_k} \,\|\, p_k\bigr)}_{\text{stay near prior}}.\]

Two opposing pulls. The first term rewards explaining the data; the second penalises the Kullback-Leibler (KL) divergence from the prior. Both are sums, so more data means more likelihood terms, which dominate the prior as \(S \cdot N\) grows.

With a guide and an objective, the rest is gradient ascent. The expectation in the data-fit term is itself intractable, so every step Monte-Carlo-estimates it from num_particles draws of \(q_\phi\), then backpropagates the average.

# Mean-field SVI: tune each factor's knobs mu_k and sigma_k (together: phi)
# so the guide q_phi matches the posterior as closely as mean-field allows.

# Initialise the knobs. One (mu_k, sigma_k) pair per latent (every theta_s,
# every beta_n, every alpha_n); mu_k starts at 0, sigma_k at 1.
init phi = (mu_k, sigma_k) for every latent k

for epoch in range(T):
    # ELBO: how well the current guide matches the posterior. Higher is
    # better.
    #
    # Computed by Monte Carlo:
    #   1. draw many latents from q_phi via the reparameterisation trick.
    #   2. score them under the model and under q_phi,
    #   3. combine into a single number and do gradient ascent
    loss = -elbo(model, guide, X)

    loss.backward()         # gradient w.r.t. every mu_k, sigma_k
    adam.step()

# Point estimates are just the fitted means.
theta_hat, beta_hat, alpha_hat = mu_theta, mu_beta, mu_alpha

Mean-field SVI is fast because it constrains \(q_\phi\), and that same constraint introduces three predictable biases [1]:

Failure	Practical implication
Mode-seeking	If the posterior is multi-modal, \(q_\phi\) lands on one mode and ignores the others; different seeds can pick different modes. Rerun with several seeds and check Kendall’s \(W\).
Variance underestimation	\(\sigma_k\) can under-report the true posterior spread when posterior correlations are non-trivial [1]. Cross-check with MCMC or a bootstrap.
No posterior correlations	\(\mathrm{Cov}_q(\theta_s, \beta_n) = 0\) by construction. Any downstream quantity needing the joint (e.g. \(P(\theta_s > \theta_{s'})\)) is miscalibrated.

5.3. Markov Chain Monte Carlo (MCMC)#

MCMC replaces SVI’s optimisation with simulation. Instead of fitting a parametric stand-in, we build a Markov chain whose long-run distribution is the posterior itself, run it forward, and keep the visited states as draws. There is no guide family and no projection bias; the price is wall clock.

A draw is one complete assignment of every latent: one \(\theta_s\) for each subject, one \(\beta_n\) (and \(\alpha_n\)) for each item. A chain is an ordered sequence of such draws,

\[\underbrace{(\theta^{(0)}, \beta^{(0)}, \alpha^{(0)})}_{\text{draw }0} \;\to\; \underbrace{(\theta^{(1)}, \beta^{(1)}, \alpha^{(1)})}_{\text{draw }1} \;\to\; \cdots \;\to\; \underbrace{(\theta^{(T)}, \beta^{(T)}, \alpha^{(T)})}_{\text{draw }T},\]

where each step takes the current state and produces the next under a transition rule designed so that the chain settles onto \(p(\theta, \beta, \alpha \mid X)\). After enough steps, the empirical distribution of draws matches the posterior, so sample averages over draws estimate posterior expectations.

# MCMC: step a Markov chain whose long-run distribution is the posterior,
# and keep the visited states as draws.
state = initial_state()

draws = []
for step in range(T):
    state = step_chain(state, model, X)
    draws.append(state)

# Stack per-step states into arrays of shape (T, S), (T, N), ...
theta_draws = stack([d.theta for d in draws])    # shape (T, S)

The kept draws are a (T, S) array of joint posterior samples. Every quantity we use downstream is read off this array as a sample average:

#	Quantity	Computed from `theta_draws`	Reads as
1	Posterior mean	`theta_draws.mean(axis=0)`	\(\hat\theta_s\), the point estimate that orders subjects
2	Credible interval	`quantile(theta_draws, [0.05, 0.95], axis=0)`, or 90% HDI	90% error bar on \(\hat\theta_s\)
3	Rank distribution	per-draw rank of each subject, then quantiles or fractions	rank CI; \(P(\text{subject } s \text{ in top-}k)\)
4	Pairwise comparison	`(theta_draws[:, :, None] > theta_draws[:, None, :]).mean(axis=0)`	\(P(\theta_s > \theta_{s'} \mid X)\)

All four are sample averages, so accuracy improves with more draws. However, before any of those averages can be trusted, two checks tell us whether the chain has actually settled onto the posterior.

The first compares chains: run several from different starting points and ask, for each latent, whether they agree. The Gelman-Rubin statistic \(\hat R\) formalises this by comparing two variances: how much one chain wanders on its own (within-chain variance), and how far the chains’ means sit from one another (between-chain variance). When the chains have mixed onto the same posterior, both numbers track the same thing, and their ratio \(\hat R\) approaches \(1\). Values above \(1.01\) mean the chains have not agreed yet, so at least one of them is still exploring.
The second check corrects for autocorrelation. Consecutive draws within one chain are not independent: each is a small perturbation of the previous, so \(T\) raw draws carry less information than \(T\) truly independent samples would. The effective sample size is the equivalent count of independent draws, \(T_\text{eff} = T / \bigl(1 + 2 \sum_k \rho_k\bigr)\), where \(\rho_k\) is the correlation between draws separated by \(k\) steps. If consecutive draws are nearly identical, those correlations stay near \(1\) for many lags and \(T_\text{eff}\) collapses far below \(T\); we want at least a few hundred per latent.

5.4. References#

[1] (1,2)

D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: a review for statisticians. JASA, 2017. arXiv:1601.00670.