Article List - NewsStore

Stanford AI Research Yifan Mai

HELM Long Context

We introduce the HELM Long Context leaderboard for transparent, comparable and reproducible evaluatio...

3 months, 2 weeks ago • stanford-cr…

27245 words 90 min

Stanford AI Research

Reliable and Efficient Amortized Model-Based Evaluation

TLDR: We enhance the reliability and efficiency of language model evaluation by introducing IRT-based adaptive testing, which has been integrated into...

7 months, 1 week ago • stanford-cr…

6855 words 22 min

Stanford AI Research

Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Pu…

TL;DR...

7 months, 2 weeks ago • stanford-cr…

40452 words 134 min

Stanford AI Research

BountyBench: Dollar Impact of AI Agent Attackers and Defend…

We introduce BountyBench, a benchmark featuring 25 systems with complex, real-world codebases, and 40 bug...

7 months, 3 weeks ago • stanford-cr…

14298 words 47 min

Stanford AI Research

HELM Capabilities: Evaluating LMs Capability by Capability

Introducing HELM Capabilities, a benchmark that evaluates language models across a curated set of key capabilities, providing a comparison of their st...

9 months, 3 weeks ago • stanford-cr…

55695 words 185 min

Stanford AI Research

General-Purpose AI Needs Coordinated Flaw Reporting

Today, we are calling for AI developers to invest in the needs of third-party, independent researchers, who investigate flaws in AI systems. Our new p...

10 months ago • stanford-cr…

10325 words 34 min

Stanford AI Research Farzaan Kaiyom

HELM Safety: Towards Standardized Safety Evaluations of Lan…

*Work done while at Stanford CRFM...

1 year, 2 months ago • stanford-cr…

24957 words 83 min

Stanford AI Research

Advancing Customizable Benchmarking in HELM via Unitxt Inte…

The Holistic Evaluation of Language Models (HELM) framework is an open source framework for reproducible and transparent benchmarking of language mode...

1 year, 4 months ago • stanford-cr…

6084 words 20 min

Stanford AI Research Yifan Mai

ThaiExam Leaderboard in HELM

ThaiExam Leaderboard in HELM...

1 year, 4 months ago • stanford-cr…

9485 words 31 min

Stanford AI Research

Cybench: A Framework for Evaluating Cybersecurity Capabilit…

We introduce Cybench, a benchmark consisting of 40 cybersecurity tasks from professional CTF competitions. ...

1 year, 4 months ago • stanford-cr…

13307 words 44 min