Reliable and Efficient Amortized Model-Based Evaluation

TLDR: We enhance the reliability and efficiency of language model evaluation by introducing IRT-based adaptive testing, which has been integrated into the HELM framework.

Motivation

Evaluating large language models (LLMs) can be prohibitively expensive. Modern benchmarks typically involve hundreds of thousands of questions, and evaluating such large datasets is resource-intensive: each language model might take hours, days, or even weeks to produce answers, demanding many high-performance computers. In addition, grading these answers often requires a judge, which might cost hundreds of human annotator hours or thousands of dollars when using high-performance but expensive LM judges.

Can we reliably estimate LLMs’ performance more cheaply? The answer is yes. By leveraging adaptive testing, which tailors the evaluation process to each LLM’s performance profile, we can significantly reduce the number of test instances required while preserving the reliability of the evaluation.

We achieve this through the Rasch model from Item Response Theory (IRT). The Rasch model captures how LLM ability and question difficulty affect the correctness at the resolution of each (LLM, question) pair. The first phase is the calibration phase, where the data is analyzed to estimate the ability parameter of each LLM and the difficulty parameter of each question. According to the Rasch model, the difference between ability and difficulty directly specifies the likelihood that the LLM correctly answers the question.

The difficulty parameters derived from the calibration phase play a crucial role in reducing the cost of evaluation through the second phase: adaptive testing. Adaptive testing dynamically selects the most informative questions, tailoring the evaluation process to each LLM’s estimated ability. Rather than exposing each LLM to an exhaustive and static set of questions, adaptive testing optimizes efficiency by continuously choosing the questions that best elicit information about the LLM’s ability, thus also preserving the reliability of the evaluation.

Evaluation based on the Rasch Model

On the same test set, the LLM’s ability estimated via the Rasch calibration phase is effectively identical to widely used metrics on the AI leaderboards, such as average scores. Though Rasch ability usually follows a standard normal distribution and average scores lie between 0 and 1, the two metrics are perfectly correlated and yield the same model ranking. For instance, as shown in Figure 1, on the Civil Comments dataset, the Pearson correlation between the Rasch-estimated ability and the average score reaches 0.99. Both metrics yield identical model rankings, with no distortion in ordering.

Scatter plot of Average Score against Rasch-Estimated Ability

Figure 1: Relationship between Rasch-estimated ability and average score on the Civil Comments dataset. Each dot is one LLM. The perfect Pearson correlation (0.99) highlights the consistency between Rasch-estimated ability and widely-used metrics on the same set of questions.

Next, we demonstrate that the Rasch model accurately captures LM abilities and question difficulties during the calibration phase across 22 datasets drawn from 5 HELM leaderboards. This extensive collection encompasses capability and safety measurements, comprising 183 LLMs and over 78,000 questions. The performance of the Rasch model is measured using out-of-sample prediction performance, i.e., how well it predicts which LLM will correctly answer which question on a hidden test set. Our results show that the Rasch model achieves an AUC-ROC of 0.85 on the training set and 0.83 on the test set on average, indicating that the Rasch model reliably reflects the LLMs’ performance across a wide range of questions.

We provide some examples from GSM8K and MMLU to illustrate that our IRT-derived difficulty parameters reflect question difficulty:

An easy question and a difficult question from GSM8K and MMLU

Figure 2: An easy question and a difficult question from GSM8K and MMLU.

Adaptive Testing

As noted, beyond establishing a stable measure of ability and difficulty, the Rasch model paves the way for adaptive testing—a strategy that refines the evaluation process by dynamically tailoring question selection to each LLM’s estimated performance. To demonstrate this benefit, we conduct adaptive testing on Llama 3.1 8B using Civil Comments as an example. Note that Llama 3.1 8B is a “held-out” LLM that was not used in the calibration phase. We compare the performance of random question selection with that of adaptive testing guided by Fisher information criteria. In our experiment, we treat the ability estimation derived from all questions as the ground truth. Then, we record the ability estimation at each step of both methods and compute the mean squared error (MSE) between the estimation and the ground truth over the first 200 selected questions. The result is clear: the adaptive testing method significantly outperforms the random approach, demonstrating a more efficient and precise evaluation process.

Line graph of MSE against Number of Asked Questions

Figure 2: Adaptive testing on Llama 3.1 8B using Civil Comments yields significantly lower MSE in ability estimation than random question selection, confirming a more efficient and precise evaluation process.

How to use?

We have uploaded the pre-calibrated difficulty parameters on HuggingFace stair-lab/reeval-difficulty-for-helm, currently supporting 22 distinct HELM datasets. We have also implemented adaptive testing as part of the HELM framework using these pre-calibrated difficulties. The code for question difficulty estimation is here. For comprehensive details, please consult the documentation.

Conclusion

As LLM evaluations become increasingly costly, Rasch-model-based adaptive testing provides a reliable and efficient approach for assessing LLM performance. This work demonstrates that evaluation based on the Rasch model and adaptive testing can be effectively applied at scale to evaluate LLMs. This framework achieves significant efficiency gains by reducing the number of necessary benchmark questions. The promising results from extensive benchmark evaluations underscore the method’s reliability and potential for broad application. The approach is now integrated into the HELM framework, setting the stage for streamlined and scalable evaluation across diverse testing scenarios. For more information, please refer to our paper.