2. Meta Analysis: Efficient Evaluation#

Running every LLM across tens of thousands of benchmark items is costly. The research line of benchmark compression asks whether a smaller subset of benchmark questions can replace the full benchmark, and reports cost savings of 80 to 99 percent at a few percent error. This section walks through the two archetypal methods of benchmark compression and their open problems.

2.1. Overview#

Large benchmarks are expensive at scale. MMLU has around 14,000 items [1], HELM spans hundreds of scenarios [2], and the Open LLM Leaderboard runs every submission across a comparable volume [3], resulting in a substantial CO2 footprint. Efficient evaluation asks whether the same score or rank can be recovered from a small fraction of the items. Current methods report 80 to 99 percent cost reduction at 1 to 3 percent error, though on interpolation only [4, 5, 6, 7].

2.2. Representative Methods#

Two archetypes illustrate how the field approaches anchor selection and score prediction. Throughout, let \(i = 1, \ldots, N\) index items in the benchmark pool, \(m = 1, \ldots, M\) index source models used for calibration, and \(j\) denote a new model to be evaluated.

tinyBenchmarks. tinyBenchmarks [4] uses IRT features for both selection and prediction. A 2-PL IRT model is fit on a calibration pool of prior model evaluations to obtain the per-item parameters \((\alpha_i, \beta_i)\), where \(\alpha_i\) is the discrimination vector and \(\beta_i\) is the difficulty. K-Means then clusters the items in this feature space, and around 100 cluster medoids are retained as anchors. For a new model \(j\), its correctness on the anchors is used to solve a logistic regression for the ability vector \(\theta_j\). Each remaining item \(i\) is then predicted by an IRT forward pass,

\[ P_{ij} = \sigma\!\left(\alpha_i^\top (\theta_j - \beta_i)\right), \]

the probability that model \(j\) passes item \(i\). A gp-IRT step blends this IRT estimate with the empirical anchor mean for the final score.

DISCO. DISCO [7] drops IRT and selects anchors by model disagreement. Let \(f^m_c(x_i)\) denote the probability that source model \(m\) assigns class \(c\) to item \(x_i\), with \(c = 1, \ldots, C\). The Pairwise Disagreement Score takes the most confident source-model prediction per class and averages it over classes,

\[ \mathrm{PDS}(x_i) = \frac{1}{C} \sum_{c=1}^{C} \max_{m} f^m_c(x_i). \]

High-PDS items are those on which different source models confidently disagree. The top-\(K\) items form the anchor set. For a new model \(j\), its full probability vector \(f^j(x_i)\) over the anchors (around 3,100 numbers for typical \(K\)) is reduced to 256 dimensions by PCA and passed through a Random Forest trained to predict the full-benchmark score.

2.3. Open Problems and Conclusions#

The headline savings are real and the underlying methods are sound, but two caveats temper how far they generalize: a fundamental extrapolation gap, and a structural risk that small subsets amplify bias rather than average it out.

Extrapolation Frontier. All current methods break when evaluating models stronger than the calibration pool [8]. A very simple linear model often matches the fancier estimators, and subset quality decays as the frontier moves. This is the central open question in the subfield.

Contamination Blind Spot. If the selected subset contains training-leaked items, the compressed benchmark amplifies the bias rather than averaging it out.

Overall, efficient evaluation targets benchmarks of \(10^3\) to \(10^5\) items and aims to compress them down to a few percent of the original size. Driving leaderboards in our scope sit well below that range. At this scale, the ultra-compression regime does not apply, especially since proper extrapolation remains a fundamentally unsolved problem. However, modest savings are still plausible.

2.4. References#

[1]

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. In ICLR. 2021.

[2]

P. Liang and others. Holistic evaluation of language models (HELM). arXiv preprint arXiv:2211.09110, 2022.

[3]

E. Beeching, C. Fourrier, N. Habib, S. Han, N. Lambert, N. Rajani, O. Sanseviero, L. Tunstall, and T. Wolf. Open LLM leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.

[4] (1,2)

F. M. Polo and others. tinyBenchmarks: evaluating LLMs with fewer examples. In ICML. 2024.

[5]

A. Kipnis, K. Voudouris, L. M. Schulze Buschoff, and E. Schulz. metabench: a sparse benchmark to measure general ability in large language models. In ICLR. 2025. arXiv:2407.12844.

[6]

C. Xu, G. Saranathan, M. P. Alam, A. Shah, J. Lim, S. Y. Wong, M. Foltin, and S. Bhattacharya. Data efficient evaluation of large language models and text-to-image models via adaptive sampling (SubLIME). arXiv preprint arXiv:2406.15527, 2024.

[7] (1,2)

A. Rubinstein, B. Raible, M. Gubri, and S. J. Oh. DISCO: diversifying sample condensation for efficient model evaluation. In ICLR. 2026. arXiv:2510.07959.

[8]

G. Zhang, F. E. Dorner, and M. Hardt. How benchmark prediction from fewer data misses the mark. arXiv preprint arXiv:2506.07673, 2025. NeurIPS 2025.