Article List
Explore latest news, discover interesting content, and dive deep into topics that interest you
HELM Long Context
We introduce the HELM Long Context leaderboard for transparent, comparable and reproducible evaluatio...
Stanford AI Research
Reliable and Efficient Amortized Model-Based Evaluation
TLDR: We enhance the reliability and efficiency of language model evaluation by introducing IRT-based adaptive testing, which has been integrated into...
Stanford AI Research
Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Pu…
TL;DR...
Stanford AI Research
BountyBench: Dollar Impact of AI Agent Attackers and Defend…
We introduce BountyBench, a benchmark featuring 25 systems with complex, real-world codebases, and 40 bug...
Stanford AI Research
HELM Capabilities: Evaluating LMs Capability by Capability
Introducing HELM Capabilities, a benchmark that evaluates language models across a curated set of key capabilities, providing a comparison of their st...
Stanford AI Research
General-Purpose AI Needs Coordinated Flaw Reporting
Today, we are calling for AI developers to invest in the needs of third-party, independent researchers, who investigate flaws in AI systems. Our new p...
HELM Safety: Towards Standardized Safety Evaluations of Lan…
*Work done while at Stanford CRFM...
Stanford AI Research
Advancing Customizable Benchmarking in HELM via Unitxt Inte…
The Holistic Evaluation of Language Models (HELM) framework is an open source framework for reproducible and transparent benchmarking of language mode...
ThaiExam Leaderboard in HELM
ThaiExam Leaderboard in HELM...
Stanford AI Research
Cybench: A Framework for Evaluating Cybersecurity Capabilit…
We introduce Cybench, a benchmark consisting of 40 cybersecurity tasks from professional CTF competitions. ...