HELM Long Context
<blockquote> <p>We introduce the <a href="https://crfm.stanford.edu/helm/long-context/latest/"><strong>HELM Long Context leaderboard</strong></a> for transparent, comparable and reproducible evaluatio...
We introduce the HELM Long Context leaderboard for transparent, comparable and reproducible evaluations of long context capabilities of recent models.
Introduction
Recent Large Language Models (LLMs) support processing long inputs with hundreds of thousands or millions of tokens. Long context capabilities are important for many real-world applications, such as processing long text documents, conducting long conversations or following complex instructions. However, support for long inputs does not equate to strong long context capabilities. Earlier long context models have failed at relatively simple tasks such as needle-in-the haystack (NIAH). As such, there is a need for rigorous evaluations of long context capabilities.
There are a number of available long context benchmarks, such as ZeroScrolls (Shaham et al., 2023), MTOB (Tanzer et al., 2023), ∞Bench (Zhang et al., 2024), RULER (Hsieh et al., 2024), LOFT (Lee et al., 2024), Michelangelo (Vodrahalli et al., 2024) and HELMET (Yen et al., 2025). Using these benchmarks, some model developers have published long context benchmark results for their models. However, there is a lack of rigorous and comprehensive benchmark coverage of recent models. To address this, we introduce the HELM Long Context leaderboard, which evaluates leading long context LLMs on a curated set of tasks from existing long context benchmarks. This leaderboard provides transparent, comparable and reproducible evaluations of long context capabilities of recent models.
This leaderboard was produced through a research collaboration with LVMH, and was funded by the HAI Industrial Affiliate Program.
Benchmarks
To assess long context understanding and reasoning capabilities, we selected benchmarks that are recent, relevant, and consist of input instances with hundreds of thousands of tokens. Many of our selected benchmarks are challenging and have not been saturated by recent LLMs yet, thus allowing room for LLMs to demonstrate future improvements in capabilities. We selected five tasks: RULER SQuAD, RULER HotPotQA, ∞Bench En.MC, ∞Bench En.Sum, and OpenAI-MRCR. These tasks were sourced from the following benchmarks.
RULER
RULER (Hsieh et al., 2024) is a benchmark that aims to expand on the vanilla NIAH benchmark by introducing new task categories that test capabilities beyond searching from context. It includes 13 tasks across 4 tasks categories (retrieval, multi-hop tracing, aggregation and question answering).
We selected two tasks from RULER:
- RULER SQuAD — open ended single-hop question answering using questions and passages from SQuAD (Rajpurkar et al., 2018), using the paper authors’ string match metric as the main accuracy metric
- RULER HotPotQA — open ended multi-hop question answering using questions and passages from HotPotQA (Yang et al., 2018), using the paper authors’ string match metric as the main accuracy metric
Both tasks require answering questions based on short passages. The input instances are constructed by combining the passages required for the answers with distractor passages that are sampled from the dataset. This setup is similar to the retrieval augmented generation (RAG) task, where a model is asked to answer a question based on retrieved documents, not all of which may be relevant to the question.
Note that RULER is a configurable benchmark that dynamically constructs input instances given a configuration consisting of a desired sequence length and tokenizer. In the case of these two question answering tasks, the sequence length configuration controls the number of distractor documents that are added to each input instance.
∞Bench
∞Bench (Zhang et al., 2024) is a benchmark that includes 12 tasks in English and Chinese spanning 5 domains (retrieval, code, math, novels, and dialogue). It consists of 3946 examples with an average length of approximately 200K tokens.
We selected two tasks from ∞Bench:
- ∞Bench En.MC — multiple choice question answering based on the plot of an English novel, using exact match as the main accuracy metric
- ∞Bench En.Sum — summarization of the plot of an English novel, using ROUGE-L as the main accuracy metric
For both tasks, the paper’s authors constructed the inputs using novels sourced from websites. In order to mitigate train-test overlap, the authors employed key entity replacement as a countermeasure by substituting prominent entities such as main character names with unrelated ones.
OpenAI-MRCR
OpenAI-MRCR is OpenAI’s open source version of the Multi-Round Co-reference Resolution (MRCR) task, which was originally introduced in the Gemini 1.5 report (Google et al. 2024) and Michaelangelo (Vodrahalli et al., 2024). MRCR can be viewed as a more challenging extension of NIAH. MRCR and NIAH are both retrieval evaluations, but unlike NIAH, MRCR requires retrieving information from multiple locations in the input.
The model is given a long, multi-turn, synthetic, conversation between a user and model in which the user requests the model to generate a text on a topic, e.g. “write a poem about tapirs”. Several identical requests are hidden in this conversational context. At the end of the conversation, the model is prompted to return the i-th instance of one of these requests, with the addition of the prefix, e.g. “Prepend aYooSG8CQg to the 2nd (1 indexed) poem about tapirs”. The main accuracy metric is computed as follows: if the model’s response includes the required prefix, then the response is scored using SequenceMatcher from Python’s difflib; otherwise, the response receives a score of 0.
Input Processing
For each of the above tasks, we only used input instances with at most 128K tokens in order to avoid input instances that exceed the maximum input length of our models. For ∞Bench En.Sum, ∞Bench En.MC and OpenAI-MRCR, we filtered out input instances with >128K tokens under whitespace tokenization. For RULER HotPotQA and RULER SQuAD, we generated instances by configuring RULER with a sequence length of 128K with a whitespace tokenizer. Finally, we sub-sampled 100 instances from each task for use in our evaluations.
Models
We evaluated 10 recent models from 5 organizations with context lengths ranging from 300K to 10M. We selected these models due to their strong performance on the HELM Capabilities leaderboard, which measures general LLM capabilities. Of these models, only the Meta Llama 4 models are open-weights, while the remaining models are closed-weights.
The table below summarizes the context lengths and pre-existing long context evaluation results that have been provided by the model developers.
| Organization | Model Name | Context Length | Previous Long-Context Evaluations |
|---|---|---|---|
| Amazon | Nova Premier | 1M | None |
| Amazon | Nova Pro | 300K | LLMTest NIAH, ZeroScrolls SQuALITY |
| Amazon | Nova Lite | 300K | LLMTest NIAH, ZeroScrolls SQuALITY |
| Gemini 2.0 Flash | 1M | Michelangelo MRCR | |
| Gemini 2.0 Flash Lite | 1M | Michelangelo MRCR | |
| Meta | Llama 4 Scout (17Bx16E) Instruct | 10M | MTOB, NIAH (unknown version) |
| Meta | Llama 4 Maverick (17Bx128E) Instruct FP8 | 10M | MTOB, NIAH (unknown version) |
| OpenAI | GPT-4.1 | 1M | OpenAI MRCR, Graphwalks |
| OpenAI | GPT-4.1 mini | 1M | OpenAI MRCR, Graphwalks |
| OpenAI | GPT-4.1 nano | 1M | OpenAI MRCR, Graphwalks |
| Writer | Palmyra X5 | 1M | OpenAI MRCR |
Although there were some pre-existing long context evaluation results, we found that these results were neither comprehensive nor comparable. There are multiple versions of NIAH and MRCR, and different versions were used on different models. Additionally, the Gemini 2.0 benchmark results used an internal version of MRCR that was not accessible to external researchers, and the Llama 4 benchmark results did not specify which version of NIAH was used.
Results
The benchmark results are as follows:
| Model | Mean score | RULER SQuAD - RULER String Match | RULER HotPotQA - RULER String Match | ∞Bench En.MC - EM | ∞Bench En.Sum - ROUGE-L | OpenAI MRCR - MRCR Accuracy |
|---|---|---|---|---|---|---|
| GPT-4.1 (2025-04-14) | 0.588 | 0.88 | 0.7 | 0.97 | 0.174 | 0.214 |
| GPT-4.1 mini (2025-04-14) | 0.53 | 0.82 | 0.64 | 0.82 | 0.16 | 0.208 |
| Gemini 2.0 Flash | 0.527 | 0.85 | 0.55 | 0.87 | 0.151 | 0.216 |
| Palmyra X5 | 0.525 | 0.78 | 0.57 | 0.87 | 0.146 | 0.256 |
| Llama 4 Maverick (17Bx128E) Instruct FP8 | 0.519 | 0.78 | 0.55 | 0.89 | 0.161 | 0.215 |
| Amazon Nova Premier | 0.5 | 0.71 | 0.57 | 0.92 | 0.137 | 0.161 |
| Gemini 2.0 Flash Lite | 0.497 | 0.77 | 0.53 | 0.85 | 0.155 | 0.179 |
| Llama 4 Scout (17Bx16E) Instruct | 0.469 | 0.68 | 0.47 | 0.85 | 0.176 | 0.171 |
| Amazon Nova Lite | 0.404 | 0.59 | 0.35 | 0.82 | 0.148 | 0.111 |
| GPT-4.1 nano (2025-04-14) | 0.395 | 0.56 | 0.41 | 0.72 | 0.113 | 0.17 |
| Amazon Nova Pro | 0.392 | 0.59 | 0.37 | 0.79 | 0.11 | 0.099 |
GPT-4.1 obtained the highest mean score of 0.588. GPT-4.1 also obtained the highest scores on RULER HotPotQA, RULER SQuAD, and ∞Bench En.MC, and also obtained the second highest score on ∞Bench En.Sum. Llama 4 Scout (17Bx16E) Instruct obtained the highest score on ∞Bench En.Sum, and Palmyra X5 obtained the highest score on MRCR.
In general, the rankings achieved on the long context leaderboard for these models are similar to their rankings on the HELM Capabilities leaderboard, with a Spearman rank correlation of 0.90 (p=0.00016). Notably, GPT-4.1 is the top ranked model on both the Long Context and Capabilities leaderboards.
We observed that for most model families, model performance generally increased with model size. The one exception was the Amazon Nova family, in which Amazon Nova Lite outperformed Amazon Nova Pro by achieving a higher mean score and a higher score in three out of five benchmarks.
The results demonstrate that even though significant progress has been made on long context capabilities, there is still considerable room for improvement. For instance, the highest accuracy score on MRCR was only 0.256, even though this is a computationally simple task.
Limitations
Long context benchmarking is an active area of research, and many recent benchmarks have been proposed that address various shortcomings of our chosen benchmarks. We did not include these benchmarks in our evaluations, but we acknowledge their usefulness for future evaluations:
- HELMET (Yen et al., 2025) includes additional task categories such as passage-reranking and generation with citations, uses model-based metrics for improved metric reliability, and uses in-context learning to allow evaluation of base models.
- NoCHA (Karpinska et al., 2024) is a true-or-false question answering benchmark over books; it uses only books published in 2023 or 2024 in order to mitigate the risk of train-test overlap.
- OpenAI-GraphWalks is a graph traversal task where the model is presented with a list of edges and asked to perform a graph traversal operation; it is challenging because unlike other multi-hop tasks, it cannot be solved with a single sequential traversal of the input.
Our leaderboard also lacks coverage of certain kinds of realistic long context tasks that arise in an industrial context. Long context models may be used to analyze long documents including images, charts and figures. Long context models may also be deployed as conversational assistants, in which they may be required to hold long conversations with users while performing tasks using tools. Our leaderboard currently does not include any tasks involving multimodal inputs, realistic user conversations, or tool usage.
Conclusion
The HELM Long Context leaderboard provides comparable benchmarking results of long context capabilities of recent models. As with all HELM leaderboards, this leaderboard provides full transparency into all LLM requests and responses, and the results are reproducible using the HELM open source framework. We hope that this provides a valuable resource to the community.
Appendix
Sample prompts
The following are example instances from each of our tasks. The ellipses (...) indicate where text was removed from the original input for brevity of this blog post.
RULER HotPotQA
Input:
Answer the question based on the given documents. Only give me the answer and do not output any other words.
The following are given documents.
Document 1:
Kiss Land
Kiss Land is the debut studio album by Canadian singer The Weeknd. It was released on September 10, 2013, by XO and Republic Records. The album was supported by six singles: "Kiss Land", "Belong to the World", "Love in the Sky", "Live For", "Pretty" and "Wanderlust". The album's sole guest appearance comes from frequent collaborator Drake. The album's production was primarily handled by DannyBoyStyles, The Weeknd himself and Jason "DaHeala" Quenneville, among others.
...
Document 170:
Mick Carter
Michael "Mick" Carter is a fictional character from the BBC soap opera "EastEnders" played by Danny Dyer, making his first appearance in the show in the episode originally broadcast in the United Kingdom on 25 December 2013. Dyer's casting was announced on 1 October 2013 with that of his on-screen partner, Linda Carter, played by Kellie Bright. He is the son of established character Shirley Carter (Linda Henry) and a member of the Carter family. Mick immediately becomes the new landlord of The Queen Victoria public house (referred to as "The Vic"), bringing Linda, their son Johnny Carter (Sam Strike/Ted Reilly) and pet dog Lady Di with him. They are later joined by daughter Nancy Carter (Maddy Hill), son Lee Carter (Danny-Boy Hatchard) and other members of the extended family.
...
Document 252:
The Queen Victoria
The Queen Victoria (more often referred to as The Queen Vic or The Vic) is the Victorian public house in the BBC soap opera, "EastEnders". It has the fictional address of 46 Albert Square, Walford, London E20.
...
Document 1390:
MillerCoors
MillerCoors is a beer brewing company in the United States. In 2002 South African Breweries purchased Miller Brewing Company to create SABMiller. In 2005, Molson Brewery of Canada and Coors Brewing Company merged to form the Molson Coors Brewing Company. Then, in 2008, SABMiller and Molson Coors created MillerCoors as a joint venture for their operations in the U.S. The company is the second-largest brewer in the U.S., after Anheuser-Busch.
Answer the question based on the given documents. Only give me the answer and do not output any other words.
Question: Mick Carter is the landlord of a public house located at what address? Answer:
Reference answer:
46 Albert Square, Walford, London E20
RULER SQuAD
Input:
Answer the question based on the given documents. Only give me the answer and do not output any other words.
The following are given documents.
Document 1:
Southern California is also home to a large home grown surf and skateboard culture. Companies such as Volcom, Quiksilver, No Fear, RVCA, and Body Glove are all headquartered here. Professional skateboarder Tony Hawk, professional surfers Rob Machado, Tim Curran, Bobby Martinez, Pat O'Connell, Dane Reynolds, and Chris Ward, and professional snowboarder Shaun White live in southern California. Some of the world's legendary surf spots are in southern California as well, including Trestles, Rincon, The Wedge, Huntington Beach, and Malibu, and it is second only to the island of Oahu in terms of famous surf breaks. Some of the world's biggest extreme sports events, including the X Games, Boost Mobile Pro, and the U.S. Open of Surfing are all in southern California. Southern California is also important to the world of yachting. The annual Transpacific Yacht Race, or Transpac, from Los Angeles to Hawaii, is one of yachting's premier events. The San Diego Yacht Club held the America's Cup, the most prestigious prize in yachting, from 1988 to 1995 and hosted three America's Cup races during that time.
...
Document 801:
Normandy was the site of several important developments in the history of classical music in the 11th century. Fécamp Abbey and Saint-Evroul Abbey were centres of musical production and education. At Fécamp, under two Italian abbots, William of Volpiano and John of Ravenna, the system of denoting notes by letters was developed and taught. It is still the most common form of pitch representation in English- and German-speaking countries today. Also at Fécamp, the staff, around which neumes were oriented, was first developed and taught in the 11th century. Under the German abbot Isembard, La Trinité-du-Mont became a centre of musical composition.
...
Document 1020:
Where CHP is not used, steam turbines in power stations use surface condensers as a cold sink. The condensers are cooled by water flow from oceans, rivers, lakes, and often by cooling towers which evaporate water to provide cooling energy removal. The resulting condensed hot water output from the condenser is then put back into the boiler via a pump. A dry type cooling tower is similar to an automobile radiator and is used in locations where water is costly. Evaporative (wet) cooling towers use the rejected heat to evaporate water; this water is kept separate from the condensate, which circulates in a closed system and returns to the boiler. Such towers often have visible plumes due to the evaporated water condensing into droplets carried up by the warm air. Evaporative cooling towers need less water flow than "once-through" cooling by river or lake water; a 700 megawatt coal-fired power plant may use about 3600 cubic metres of make-up water every hour for evaporative cooling, but would need about twenty times as much if cooled by river water.[citation needed]
Answer the question based on the given documents. Only give me the answer and do not output any other words.
Question: In what century did important classical music developments occur in Normandy? Answer:
Reference answer:
11th
∞Bench En.MC
Input:
Read the passage and answer the following question. Respond with only a single letter corresponding to your choice. Do not include a period in your answer.
Chapter - "The InGen Incident"
The late twentieth century has witnessed a scientific gold rush of astonishing proportions: the headlong and furious haste to commercialize genetic engineering. This enterprise has proceeded so rapidly-with so little outside commentary-that its dimensions and implications are hardly understood at all.
Biotechnology promises the greatest revolution in human history. By the end of this decade, it will have outdistanced atomic power and computers in its effect on our everyday lives. In the words of one observer, "Biotechnology is going to transform every aspect of human life: our medical care, our food, our health, our entertainment, our very bodies. Nothing will ever be the same again. It's literally going to change the face of the planet."
...
Lucky said, "You're telling me we're not going anywhere?"
By the end of the novel, which dinosaur species is presumed to have attacked Ernie?
A. Procompsognathid
B. Tyrannosaurus
C. Velociraptor
D. Hypsilophodontids
Answer the question above based on the passage. Respond with only a single letter corresponding to your choice. Do not include a period in your answer.
Reference answer:
C
∞Bench En.Sum
Input:
‘Yes, of course, if it’s fine to-morrow,’ said Mrs Bronwyn. ‘But you’ll have to be up with the lark,’ she added.
To her son these words conveyed an extraordinary joy, as if it were settled the expedition were bound to take place, and the wonder to which he had looked forward, for years and years it seemed, was, after a night’s darkness and a day’s sail, within touch. Since he belonged, even at the age of six, to that great clan which cannot keep this feeling separate from that, but must let future prospects, with their joys and sorrows, cloud what is actually at hand, since to such people even in earliest childhood any turn in the wheel of sensation has the power to crystallize and transfix the moment upon which its gloom or radiance rests, Seb Bronwyn, sitting on the floor cutting out pictures from the illustrated catalogue of the Army and Navy Stores, endowed the picture of a refrigerator as his mother spoke with heavenly bliss. It was fringed with joy. The wheelbarrow, the lawnmower, the sound of poplar trees, leaves whitening before rain, rooks cawing, brooms knocking, dresses rustling—all these were so coloured and distinguished in his mind that he had already his private code, his secret language, though he appeared the image of stark and uncompromising severity, with his high forehead and his fierce blue eyes, impeccably candid and pure, frowning slightly at the sight of human frailty, so that his mother, watching him guide his scissors neatly round the refrigerator, imagined him all red and ermine on the Bench or directing a stern and momentous enterprise in some crisis of public affairs.
...
Quickly, as if she were recalled by something over there, she turned to her canvas. There it was—her picture. Yes, with all its green and blues, its lines running up and across, its attempt at something. It would be hung in the attics, she thought; it would be destroyed. But what did that matter? she asked herself, taking up her brush again. She looked at the steps; they were empty; she looked at her canvas; it was blurred. With a sudden intensity, as if she saw it clear for a second, she drew a line there, in the centre. It was done; it was finished. Yes, she thought, laying down her brush in extreme fatigue, I have had my vision.
The summary of the book is:
Reference answer:
The novel is fragmented into stream-of-consciousness contributions from various narrators.
“The Window” opens just before the start of World War I. Mr. Bronwyn and Mrs. Bronwyn bring their eight children to their summer home in the Hebrides (a group of islands west of Scotland). Across the bay from their house stands a large lighthouse. Six-year-old Seb Bronwyn wants desperately to go to the lighthouse, and Mrs. Bronwyn tells him that they will go the next day if the weather permits. Seb reacts gleefully, but Mr. Bronwyn tells him coldly that the weather looks to be foul. Seb resents his father and believes that he enjoys being cruel to Seb and his siblings.
...
In “The Kindergarten” section, time returns to the slow detail of shifting points of view, similar in style to “The Window.” Mr. Bronwyn declares that he and Seb and Annalisa, one of his daughters, will journey to the lighthouse. On the morning of the voyage, delays throw him into a fit of temper. He appeals to Pete for sympathy, but, unlike Mrs. Bronwyn, she is unable to provide him with what he needs. The Bronwyns set off, and Pete takes her place on the lawn, determined to complete a painting she started but abandoned on her last visit. Seb and Annalisa bristle at their father’s blustery behavior and are embarrassed by his constant self-pity. Still, as the boat reaches its destination, the children feel a fondness for him. Even Seb, whose skill as a sailor Mr. Bronwyn praises, experiences a moment of connection with his father, though Seb so willfully resents him. Across the bay, Pete puts the finishing touch on her painting. She makes a definitive stroke on the canvas and puts her brush down, finally having achieved her vision.
OpenAI-MRCR
Input:
user:
Here are some examples of conversations succeeded by a follow-up question answered correctly:
...
user:
write a short scene in a play about readings
assistant:
**Title: "Clairvoyant Connections"**
...
user:
write a riddle about cars
assistant:
I'm not alive, but I eat and grow,
...
user:
write a riddle about cars
assistant:
I have four wheels but I'm not a toy,
A growling engine is my pride and joy.
I carry you near, I carry you far,
I’m not a plane, but I'm faster than most by far.
Fill me up with liquid gold,
In towns and cities, I am sold.
What am I?
...
user:
write a formal letter about erasers
assistant:
[Your Name]
...
user:
Prepend dejbsyVpvK to the 6th (1 indexed) riddle about cars. Do not include any other text in your response.
Reference answer:
dejbsyVpvKI have four wheels but I'm not a toy,
A growling engine is my pride and joy.
I carry you near, I carry you far,
I’m not a plane, but I'm faster than most by far.
Fill me up with liquid gold,
In towns and cities, I am sold.
What am I?