How to Benchmark LLMs – ARC AGI 3

the last few weeks, we have seen the release of powerful LLMs such as Qwen 3 MoE, Kimi K2, and Grok 4. We will continue seeing such rapid improvements in the foreseeable future, and to compare the LLMs against each other, we require benchmarks. In this article, I discuss the newly released ARC AGI 3 benchmark and why frontier LLMs struggle to complete any tasks on the benchmark.

Motivation

Today, we’re announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI

We’re releasing:
* 3 games (environments)
* $10K agent contest
* AI agents API

Starting scores – Frontier AI: 0%, Humans: 100% pic.twitter.com/3YY6jV2RdY

— ARC Prize (@arcprize) July 18, 2025

ARC AGI 3 was recently released.

My motivation for writing this article is to stay on top of the latest developments in LLM technology. Only in the last couple of weeks have we seen the Kimi K2 model (best open-source model when released), Qwen 3 235B-A22B (currently best open-source model), Grok 4, and so on. There is so much happening in the LLM space, and one way to keep up is to track the benchmarks.

I think the ARC AGI benchmark is particularly interesting, mainly because I want to see if LLMs can match human-level intelligence. ARC AGI puzzles are made so that humans are able to complete them, but LLMs will struggle.

You can also read my article on Utilizing Context Engineering to Significantly Enhance LLM Performance and check out my website, which contains all my information and articles.

Introduction to ARC AGI

ARC AGI is essentially a puzzle game of pattern matching.

ARC AGI 1: You are given a series of input-output pairs, and have to complete the pattern
ARC AGI 2: Similar to the first benchmark, performing pattern matching on input and output examples
ARC AGI 3: Here you are playing a game, where you have to move your block into the goal area, but some required steps in between

I think it’s cool to test out these puzzle games and complete them myself. Then, you can see LLMs initially struggle with the benchmarks, and then increase their performance with better models. OpenAI, for example, scored:

7.8% with o1 mini
75% with o3-low
88% with o3-high

As you can also see in the image below:

This figure shows the performance of different OpenAI models on the ARC AGI 1 benchmark. You can see how performance increases with more advanced models. Image from ARC AGI, which is under the Apache 2 license.

Playing the ARC AGI benchmark

You can also try the ARC AGI benchmarks yourself or build an AI to perform the tasks. Go to the ARC AGI 3 website and start playing the game.

The whole point of the games is that you have no instructions, and you have to figure out the rules yourself. I enjoy this concept, as it represents figuring out an entirely new problem, without any help. This highlights your ability to learn new environments, adapt to them, and solve problems.

You can see a recording of me playing ARC AGI 3 here, encountering the problems for the first time. I was unfortunately unable to embed the link in the article. However, it was super interesting to test out the benchmark and imagine the challenge an LLM has to go through to solve it. I first observe the environment, and what happens when I perform the different actions. An action in this case is pressing one of the relevant buttons. Some actions do nothing, while others affect the environment. I then proceed to uncover the goal of the puzzle (for example, get the object to the goal area) and try to achieve this goal.

Why frontier models achieve 0%

This article states that when frontier models were tested on the ARC AGI 3 preview, they achieved 0%. This might sound disappointing to some people, considering you were probably able to successfully complete a lot of the tasks yourself, relatively quickly.

As I previously discussed, several OpenAI models have had success with the earlier ARC AGI benchmarks, with their best model achieving 88% on the first version. However, initially, models achieved 0%, or in the low single-digit percentages.

I have a few theories for why frontier models were not able to perform tasks on ARC AGI 3:

Context length

When working on ARC AGI 3, you do not get any information about the game. The model thus has to try out a variety of actions, see the output of those actions (for example, nothing happens, or a block moves, etc). The model then has to evaluate the actions it took, along with the output, and consider its next moves.

I believe the action space on ARC AGI 3 is very large, and it’s thus difficult for the models to both experiment enough to find the correct action and avoid repeating unsuccessful actions. The models essentially have a problem with their context length and utilizing the full length of it.

I recently read an interesting article from Manus about how they develop their agents and manage their memory. You can use techniques such as summarizing previous context or using a file system to store important context. I believe this will be key to increasing performance on the ARC AGI 3 benchmark.

Training dataset

Another primary reason frontier models are unable to complete ARC AGI 3 tasks successfully is that the tasks are very different from their training dataset. LLMs will almost always perform way better on a task if such a task (or a similar one) is included in the training dataset. In this instance, I believe LLMs have little training data on working with games, for example. Furthermore, an important point here is also the agentic training data for the LLMs.

With agentic training data, I mean data where the LLM is utilizing tools and performing actions. I believe we are seeing a rapid increase in LLMs used as agents, and thus, the proportional amount of training data for agentic behavior is rapidly increasing. However, it might be that current frontier models still are not as good at performing such actions, though it will likely increase rapidly in the coming months.

Some people will highlight how this proves LLMs do not have real intelligence: The whole point of intelligence (and the ARC AGI benchmark) is to be able to understand tasks without any clues, only by examining the environment. To some extent, I agree with this point, and I hope to see models perform better on ARC AGI because of increased model intelligence, and not because of benchmark chasing, a concept I explore later in this article.

Benchmark performance in the future

In the future, I believe we will see vast improvements in model performance on ARC AGI 3. Mostly because I think you can create AI agents that are fine-tuned for agentic performance, and that optimally utilize their memory. I believe relatively cheap improvements can be used to vastly improve performance, though I also expect more expensive improvements (for example, the release of GPT-5) will perform well on this benchmark.

Benchmark chasing

I think it’s important to leave a section about benchmark chasing. Benchmark chasing is the concept of LLM providers chasing optimal scores on benchmarks, rather than simply creating the best or most intelligent LLMs. This is a problem because the correlation between benchmark performance and LLM intelligence is not 100%.

In the reinforcement learning world, benchmark chasing would be referred to as reward hacking. A scenario where the agent figures out a way to hack the environment they’re in to achieve a reward, without properly performing a task.

The reason LLM providers do this is that whenever a new model is released, people usually look at two things:

Benchmark performance
Vibe

Benchmark performance is usually measured on known benchmarks, such as SWE-bench and ARC AGI. Vibe testing is also a way LLMs are often measured by the public (I’m not saying it’s a good way of testing the model, I’m simply saying it happens in practice). The problem with this, however, is that I believe it’s quite simple to impress people with the vibe of a model, because vibe checking tries some very small percentage of the action space for the LLM. You may only be asking it certain questions which are available on the web, or asking it to program an application which the model has already seen 1000 instances of in its training data.

Thus, what you should do is to have a benchmark on your own, for example, an in-house dataset that has not been leaked to the internet. Then you can benchmark which LLM works best for your use case and prioritize using this LLM.

Conclusion

In this article, I have discussed LLM benchmarks and why they are important for comparing LLMs. I have introduced you to the newly released ARC AGI 3 benchmark. This benchmark is super interesting considering humans are easily able to complete some of the tasks, while frontier models score 0%. This thus represents a task where human intelligence still outperforms LLMs.

As we advance, I believe we will see rapid improvements in LLM performance on ARC AGI 3, though I hope this will not be the result of benchmark chasing, but rather the intelligence improvement of LLMs.

Source link

#Benchmark #LLMs #ARC #AGI