• About
  • Advertise
  • Privacy & Policy
  • Contact
Thursday, January 1, 2026
  • Login
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
Advertisement
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
No Result
View All Result
Home Deep Learning

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

AiNEWS2025 by AiNEWS2025
2024-12-22
in Deep Learning
0
FACTS Grounding: A new benchmark for evaluating the factuality of large language models
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


Responsibility & Safety

Published
17 December 2024
Authors

FACTS team

Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations

Large language models (LLMs) are transforming how we access information, yet their grip on factual accuracy remains imperfect. They can “hallucinate” false information, particularly when given complex inputs. In turn, this can erode trust in LLMs and limit their applications in the real world.

Today, we’re introducing FACTS Grounding, a comprehensive benchmark for evaluating the ability of LLMs to generate responses that are not only factually accurate with respect to given inputs, but also sufficiently detailed to provide satisfactory answers to user queries.

We hope our benchmark will spur industry-wide progress on factuality and grounding. To track progress, we’re also launching the FACTS leaderboard on Kaggle. We’ve already tested leading LLMs using FACTS Grounding and have populated the initial leaderboard with their grounding scores. We will maintain and update the leaderboard as the field advances.

Current leaderboard ranking

FACTS Grounding dataset

To accurately evaluate the factuality and grounding of any given LLM, the FACTS Grounding dataset comprises 1,719 examples, each carefully crafted to require long-form responses grounded in the context document provided. Each example comprises a document, a system instruction requiring the LLM to exclusively reference the provided document, and an accompanying user request.

An example from the FACTS Grounding dataset

All examples are divided into a “public” set (860) and a “private” (859) held out set. We are releasing the public set today so anyone can use it to evaluate an LLM. Of course, we know that issues of benchmark contamination and leaderboard hacking are important to protect against, so following standard industry practice, we are keeping the private evaluation set held out. The FACTS leaderboard scores are the average performance across both public and private sets.

To ensure a diversity of inputs, the FACTS Grounding examples include documents with a variety of lengths, up to a maximum of 32,000 tokens (roughly 20,000 words), covering domains such as finance, technology, retail, medicine, and law. The user requests are similarly wide ranging, including requests for summarization, Q&A generation, and rewriting tasks. We did not include any examples that could require creativity, mathematics, or complex reasoning – capabilities which might require the model to apply more advanced reasoning in addition to grounding.

Prompt distribution

Collective judgement by leading LLMs

To succeed on a given example, an LLM must synthesize the complex information in the document and generate a long-form response that is both a comprehensive answer to the user request and fully attributable to that document.

FACTS Grounding evaluates model responses automatically using three frontier LLM judges — namely Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet. We selected a combination of different judges to mitigate any potential bias of a judge giving higher scores to the responses produced by a member of its own model family. The automatic judge models were comprehensively evaluated against a held-out test set to find the best performing judging prompt templates and to verify agreement with human raters.

Each FACTS Grounding example is judged in two phases. First, responses are evaluated for eligibility, and disqualified if they don’t sufficiently address the user’s request. Second, responses are judged as factually accurate if they are fully grounded in information contained in the provided document, with no hallucinations.

With the eligibility and grounding accuracy of a given LLM response evaluated separately by multiple AI judge models, the results are then aggregated to determine if the LLM has dealt with the example successfully. The final score for the overall grounding task is the average of all judge models’ scores across all examples. Find more details of our FACTS Grounding evaluation methodology in our paper.

A factually correct response that fails to properly address the user’s request fails the benchmarking example. Here we see three instances of model responses that the automated LLM judges considered ineligible

FACTS Grounding will continue to evolve

We are mindful that benchmarks can be quickly overtaken by progress, so this launch of our FACTS Grounding benchmark and leaderboard is just the beginning. Factuality and grounding are among the key factors that will shape the future success and usefulness of LLMs and broader AI systems, and we aim to grow and iterate FACTS Grounding as the field progresses, continually raising the bar.

We encourage the AI community to engage with FACTS Grounding, evaluate their models on the open set of examples or to submit their models for evaluation. We believe that comprehensive benchmarking methods, coupled with continuous research and development will continue to improve AI systems.

Acknowledgements

FACTS is a collaboration between Google DeepMind and Google Research.
FACTS Grounding was led by: Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Dipanjan Das, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, and Nate Keating.

We are also very grateful for contributions from: Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, and Sasha Goldshtein.

We would also like to thank Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu, and Yossi Matias for their continued support.

Source link

#FACTS #Grounding #benchmark #evaluating #factuality #large #language #models

Previous Post

2024 Was Slow For Tech IPOs, But 2025 Could Be Different

Next Post

If your AI-generated code becomes faulty, who faces the most liability exposure?

AiNEWS2025

AiNEWS2025

Next Post
If your AI-generated code becomes faulty, who faces the most liability exposure?

If your AI-generated code becomes faulty, who faces the most liability exposure?

Stay Connected test

  • 23.9k Followers
  • 99 Subscribers
  • Trending
  • Comments
  • Latest
A tiny new open source AI model performs as well as powerful big ones

A tiny new open source AI model performs as well as powerful big ones

0
Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

0
Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

0
Best Headphones for Working Out (2024): Bose, Shokz, JLab

Best Headphones for Working Out (2024): Bose, Shokz, JLab

0
MIT engineers design an aerial microrobot that can fly as fast as a bumblebee

MIT engineers design an aerial microrobot that can fly as fast as a bumblebee

2025-12-31
Production-Ready LLMs Made Simple with the NeMo Agent Toolkit

Production-Ready LLMs Made Simple with the NeMo Agent Toolkit

2025-12-31
Here we go again: Retiring coal plant forced to stay open by Trump Admin

Here we go again: Retiring coal plant forced to stay open by Trump Admin

2025-12-31
Why inventing new emotions feels so good

Why inventing new emotions feels so good

2025-12-31

Recent News

MIT engineers design an aerial microrobot that can fly as fast as a bumblebee

MIT engineers design an aerial microrobot that can fly as fast as a bumblebee

2025-12-31
Production-Ready LLMs Made Simple with the NeMo Agent Toolkit

Production-Ready LLMs Made Simple with the NeMo Agent Toolkit

2025-12-31
Here we go again: Retiring coal plant forced to stay open by Trump Admin

Here we go again: Retiring coal plant forced to stay open by Trump Admin

2025-12-31
Why inventing new emotions feels so good

Why inventing new emotions feels so good

2025-12-31
Footer logo

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow Us

Browse by Category

  • AI & Cloud Computing
  • AI & Cybersecurity
  • AI & Sentiment Analysis
  • AI Applications
  • AI Ethics
  • AI Future Predictions
  • AI in Education
  • AI in Fintech
  • AI in Gaming
  • AI in Healthcare
  • AI in Startups
  • AI Innovations
  • AI News
  • AI Research
  • AI Tools & Automation
  • Apps
  • AR/VR & AI
  • Business
  • Deep Learning
  • Emerging Technologies
  • Entertainment
  • Fashion
  • Food
  • Gadget
  • Gaming
  • Health
  • Lifestyle
  • Machine Learning
  • Mobile
  • Movie
  • Music
  • News
  • Politics
  • Review
  • Robotics & Smart Systems
  • Science
  • Sports
  • Startup
  • Tech
  • Travel
  • World

Recent News

MIT engineers design an aerial microrobot that can fly as fast as a bumblebee

MIT engineers design an aerial microrobot that can fly as fast as a bumblebee

2025-12-31
Production-Ready LLMs Made Simple with the NeMo Agent Toolkit

Production-Ready LLMs Made Simple with the NeMo Agent Toolkit

2025-12-31
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2026 JNews - Premium WordPress news & magazine theme by Jegtheme.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result

© 2026 JNews - Premium WordPress news & magazine theme by Jegtheme.