Article List

Explore latest news, discover interesting content, and dive deep into topics that interest you

Clear Filters
Stanford AI Research Yifan Mai

HELM Long Context

We introduce the HELM Long Context leaderboard for transparent, comparable and reproducible evaluatio...

3 months, 2 weeks ago stanford-cr…
27245 words 90 min
Reliable and Efficient Amortized Model-Based Evaluation Stanford AI Research

Reliable and Efficient Amortized Model-Based Evaluation

TLDR: We enhance the reliability and efficiency of language model evaluation by introducing IRT-based adaptive testing, which has been integrated into...

7 months, 1 week ago stanford-cr…
6855 words 22 min
Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Publish (Yet) Stanford AI Research

Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Pu…

TL;DR...

7 months, 2 weeks ago stanford-cr…
40452 words 134 min
BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems Stanford AI Research

BountyBench: Dollar Impact of AI Agent Attackers and Defend…

We introduce BountyBench, a benchmark featuring 25 systems with complex, real-world codebases, and 40 bug...

7 months, 3 weeks ago stanford-cr…
14298 words 47 min
HELM Capabilities: Evaluating LMs Capability by Capability Stanford AI Research

HELM Capabilities: Evaluating LMs Capability by Capability

Introducing HELM Capabilities, a benchmark that evaluates language models across a curated set of key capabilities, providing a comparison of their st...

9 months, 3 weeks ago stanford-cr…
55695 words 185 min
General-Purpose AI Needs Coordinated Flaw Reporting Stanford AI Research

General-Purpose AI Needs Coordinated Flaw Reporting

Today, we are calling for AI developers to invest in the needs of third-party, independent researchers, who investigate flaws in AI systems. Our new p...

10 months ago stanford-cr…
10325 words 34 min
Stanford AI Research Farzaan Kaiyom

HELM Safety: Towards Standardized Safety Evaluations of Lan…

*Work done while at Stanford CRFM...

1 year, 2 months ago stanford-cr…
24957 words 83 min
Advancing Customizable Benchmarking in HELM via Unitxt Integration Stanford AI Research

Advancing Customizable Benchmarking in HELM via Unitxt Inte…

The Holistic Evaluation of Language Models (HELM) framework is an open source framework for reproducible and transparent benchmarking of language mode...

1 year, 4 months ago stanford-cr…
6084 words 20 min
Stanford AI Research Yifan Mai

ThaiExam Leaderboard in HELM

ThaiExam Leaderboard in HELM...

1 year, 4 months ago stanford-cr…
9485 words 31 min
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models Stanford AI Research

Cybench: A Framework for Evaluating Cybersecurity Capabilit…

We introduce Cybench, a benchmark consisting of 40 cybersecurity tasks from professional CTF competitions. ...

1 year, 4 months ago stanford-cr…
13307 words 44 min