• About
  • Advertise
  • Privacy & Policy
  • Contact
Monday, December 29, 2025
  • Login
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
Advertisement
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
No Result
View All Result
Home Machine Learning

How to Benchmark LLMs – ARC AGI 3

AiNEWS2025 by AiNEWS2025
2025-08-01
in Machine Learning
0
How to Benchmark LLMs – ARC AGI 3
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


the last few weeks, we have seen the release of powerful LLMs such as Qwen 3 MoE, Kimi K2, and Grok 4. We will continue seeing such rapid improvements in the foreseeable future, and to compare the LLMs against each other, we require benchmarks. In this article, I discuss the newly released ARC AGI 3 benchmark and why frontier LLMs struggle to complete any tasks on the benchmark.

Motivation

Today, we’re announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI

We’re releasing:
* 3 games (environments)
* $10K agent contest
* AI agents API

Starting scores – Frontier AI: 0%, Humans: 100% pic.twitter.com/3YY6jV2RdY

— ARC Prize (@arcprize) July 18, 2025

ARC AGI 3 was recently released.

My motivation for writing this article is to stay on top of the latest developments in LLM technology. Only in the last couple of weeks have we seen the Kimi K2 model (best open-source model when released), Qwen 3 235B-A22B (currently best open-source model), Grok 4, and so on. There is so much happening in the LLM space, and one way to keep up is to track the benchmarks.

I think the ARC AGI benchmark is particularly interesting, mainly because I want to see if LLMs can match human-level intelligence. ARC AGI puzzles are made so that humans are able to complete them, but LLMs will struggle.

You can also read my article on Utilizing Context Engineering to Significantly Enhance LLM Performance and check out my website, which contains all my information and articles.

Table of Contents

Introduction to ARC AGI

ARC AGI is essentially a puzzle game of pattern matching.

  • ARC AGI 1: You are given a series of input-output pairs, and have to complete the pattern
  • ARC AGI 2: Similar to the first benchmark, performing pattern matching on input and output examples
  • ARC AGI 3: Here you are playing a game, where you have to move your block into the goal area, but some required steps in between

I think it’s cool to test out these puzzle games and complete them myself. Then, you can see LLMs initially struggle with the benchmarks, and then increase their performance with better models. OpenAI, for example, scored:

  • 7.8% with o1 mini
  • 75% with o3-low
  • 88% with o3-high

As you can also see in the image below:

This figure shows the performance of different OpenAI models on the ARC AGI 1 benchmark. You can see how performance increases with more advanced models. Image from ARC AGI, which is under the Apache 2 license.

Playing the ARC AGI benchmark

You can also try the ARC AGI benchmarks yourself or build an AI to perform the tasks. Go to the ARC AGI 3 website and start playing the game.

The whole point of the games is that you have no instructions, and you have to figure out the rules yourself. I enjoy this concept, as it represents figuring out an entirely new problem, without any help. This highlights your ability to learn new environments, adapt to them, and solve problems.

You can see a recording of me playing ARC AGI 3 here, encountering the problems for the first time. I was unfortunately unable to embed the link in the article. However, it was super interesting to test out the benchmark and imagine the challenge an LLM has to go through to solve it. I first observe the environment, and what happens when I perform the different actions. An action in this case is pressing one of the relevant buttons. Some actions do nothing, while others affect the environment. I then proceed to uncover the goal of the puzzle (for example, get the object to the goal area) and try to achieve this goal.

Why frontier models achieve 0%

This article states that when frontier models were tested on the ARC AGI 3 preview, they achieved 0%. This might sound disappointing to some people, considering you were probably able to successfully complete a lot of the tasks yourself, relatively quickly.

As I previously discussed, several OpenAI models have had success with the earlier ARC AGI benchmarks, with their best model achieving 88% on the first version. However, initially, models achieved 0%, or in the low single-digit percentages.

I have a few theories for why frontier models were not able to perform tasks on ARC AGI 3:

Context length

When working on ARC AGI 3, you do not get any information about the game. The model thus has to try out a variety of actions, see the output of those actions (for example, nothing happens, or a block moves, etc). The model then has to evaluate the actions it took, along with the output, and consider its next moves.

I believe the action space on ARC AGI 3 is very large, and it’s thus difficult for the models to both experiment enough to find the correct action and avoid repeating unsuccessful actions. The models essentially have a problem with their context length and utilizing the full length of it.

I recently read an interesting article from Manus about how they develop their agents and manage their memory. You can use techniques such as summarizing previous context or using a file system to store important context. I believe this will be key to increasing performance on the ARC AGI 3 benchmark.

Training dataset

Another primary reason frontier models are unable to complete ARC AGI 3 tasks successfully is that the tasks are very different from their training dataset. LLMs will almost always perform way better on a task if such a task (or a similar one) is included in the training dataset. In this instance, I believe LLMs have little training data on working with games, for example. Furthermore, an important point here is also the agentic training data for the LLMs.

With agentic training data, I mean data where the LLM is utilizing tools and performing actions. I believe we are seeing a rapid increase in LLMs used as agents, and thus, the proportional amount of training data for agentic behavior is rapidly increasing. However, it might be that current frontier models still are not as good at performing such actions, though it will likely increase rapidly in the coming months.

Some people will highlight how this proves LLMs do not have real intelligence: The whole point of intelligence (and the ARC AGI benchmark) is to be able to understand tasks without any clues, only by examining the environment. To some extent, I agree with this point, and I hope to see models perform better on ARC AGI because of increased model intelligence, and not because of benchmark chasing, a concept I explore later in this article.

Benchmark performance in the future

In the future, I believe we will see vast improvements in model performance on ARC AGI 3. Mostly because I think you can create AI agents that are fine-tuned for agentic performance, and that optimally utilize their memory. I believe relatively cheap improvements can be used to vastly improve performance, though I also expect more expensive improvements (for example, the release of GPT-5) will perform well on this benchmark.

Benchmark chasing

I think it’s important to leave a section about benchmark chasing. Benchmark chasing is the concept of LLM providers chasing optimal scores on benchmarks, rather than simply creating the best or most intelligent LLMs. This is a problem because the correlation between benchmark performance and LLM intelligence is not 100%.

In the reinforcement learning world, benchmark chasing would be referred to as reward hacking. A scenario where the agent figures out a way to hack the environment they’re in to achieve a reward, without properly performing a task.

The reason LLM providers do this is that whenever a new model is released, people usually look at two things:

  • Benchmark performance
  • Vibe

Benchmark performance is usually measured on known benchmarks, such as SWE-bench and ARC AGI. Vibe testing is also a way LLMs are often measured by the public (I’m not saying it’s a good way of testing the model, I’m simply saying it happens in practice). The problem with this, however, is that I believe it’s quite simple to impress people with the vibe of a model, because vibe checking tries some very small percentage of the action space for the LLM. You may only be asking it certain questions which are available on the web, or asking it to program an application which the model has already seen 1000 instances of in its training data.

Thus, what you should do is to have a benchmark on your own, for example, an in-house dataset that has not been leaked to the internet. Then you can benchmark which LLM works best for your use case and prioritize using this LLM.

Conclusion

In this article, I have discussed LLM benchmarks and why they are important for comparing LLMs. I have introduced you to the newly released ARC AGI 3 benchmark. This benchmark is super interesting considering humans are easily able to complete some of the tasks, while frontier models score 0%. This thus represents a task where human intelligence still outperforms LLMs.

As we advance, I believe we will see rapid improvements in LLM performance on ARC AGI 3, though I hope this will not be the result of benchmark chasing, but rather the intelligence improvement of LLMs.



Source link

#Benchmark #LLMs #ARC #AGI

Tags: artificial intelligencedeep learningProgrammingPythontechnology
Previous Post

Rocket Report: NASA finally working on depots, Air Force tests new ICBM

Next Post

Roundtables: Why It’s So Hard to Make Welfare AI Fair

AiNEWS2025

AiNEWS2025

Next Post
Roundtables: Why It’s So Hard to Make Welfare AI Fair

Roundtables: Why It’s So Hard to Make Welfare AI Fair

Stay Connected test

  • 23.9k Followers
  • 99 Subscribers
  • Trending
  • Comments
  • Latest
A tiny new open source AI model performs as well as powerful big ones

A tiny new open source AI model performs as well as powerful big ones

0
Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

0
Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

0
Best Headphones for Working Out (2024): Bose, Shokz, JLab

Best Headphones for Working Out (2024): Bose, Shokz, JLab

0
how Netflix adjusted its cloud operations

how Netflix adjusted its cloud operations

2025-12-29
How AI is Orchestrating the Insurance Supply Chain – with Marc Fredman of CCC Intelligent Solutions

How AI is Orchestrating the Insurance Supply Chain – with Marc Fredman of CCC Intelligent Solutions

2025-12-29
Hugging Face Transformers in Action: Learning How To Leverage AI for NLP

Hugging Face Transformers in Action: Learning How To Leverage AI for NLP

2025-12-29
Leaked Avengers: Doomsday teaser is now public

Leaked Avengers: Doomsday teaser is now public

2025-12-29

Recent News

how Netflix adjusted its cloud operations

how Netflix adjusted its cloud operations

2025-12-29
How AI is Orchestrating the Insurance Supply Chain – with Marc Fredman of CCC Intelligent Solutions

How AI is Orchestrating the Insurance Supply Chain – with Marc Fredman of CCC Intelligent Solutions

2025-12-29
Hugging Face Transformers in Action: Learning How To Leverage AI for NLP

Hugging Face Transformers in Action: Learning How To Leverage AI for NLP

2025-12-29
Leaked Avengers: Doomsday teaser is now public

Leaked Avengers: Doomsday teaser is now public

2025-12-29
Footer logo

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow Us

Browse by Category

  • AI & Cloud Computing
  • AI & Cybersecurity
  • AI & Sentiment Analysis
  • AI Applications
  • AI Ethics
  • AI Future Predictions
  • AI in Education
  • AI in Fintech
  • AI in Gaming
  • AI in Healthcare
  • AI in Startups
  • AI Innovations
  • AI News
  • AI Research
  • AI Tools & Automation
  • Apps
  • AR/VR & AI
  • Business
  • Deep Learning
  • Emerging Technologies
  • Entertainment
  • Fashion
  • Food
  • Gadget
  • Gaming
  • Health
  • Lifestyle
  • Machine Learning
  • Mobile
  • Movie
  • Music
  • News
  • Politics
  • Review
  • Robotics & Smart Systems
  • Science
  • Sports
  • Startup
  • Tech
  • Travel
  • World

Recent News

how Netflix adjusted its cloud operations

how Netflix adjusted its cloud operations

2025-12-29
How AI is Orchestrating the Insurance Supply Chain – with Marc Fredman of CCC Intelligent Solutions

How AI is Orchestrating the Insurance Supply Chain – with Marc Fredman of CCC Intelligent Solutions

2025-12-29
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2025 JNews - Premium WordPress news & magazine theme by Jegtheme.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result

© 2025 JNews - Premium WordPress news & magazine theme by Jegtheme.