• About
  • Advertise
  • Privacy & Policy
  • Contact
Ai News
Advertisement
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
No Result
View All Result
Home AI News

How AI companies are secretly collecting training data from the web (and why it matters)

AiNEWS2025 by AiNEWS2025
2025-06-30
in AI News
0
How AI companies are secretly collecting training data from the web (and why it matters)
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


gettyimages-1417866211

Getty/the_burtons

Like most people, my wife types a search into Google many times each day. We work from home, so our family room doubles as a conference room. Whenever we’re in a meeting, and a question about anything comes up, she Googles it.

This is the same as it’s been for years. But what happens next has changed.

Instead of clicking on one of the search result links, she more often than not reads the AI summary. These days, she rarely clicks on any of the sites that provide the original information that Google’s AI summarizes.

Also: How much energy does AI really use? The answer is surprising – and a little complicated

When I spoke to her about this, Denise acknowledged that she actually visits sites less frequently. But she also pointed out that, for topics where she’s well-versed, she has noticed the AI is sometimes wrong. She said she takes the AI results with a grain of salt, but they often provide basic enough information that she needs to look no further. If in doubt, she does dig deeper.

So that’s where we are today. More and more users are like my wife, getting data from the AI and never visiting websites (and therefore never giving content creators a chance to be compensated for their work).

Worse, more and more people are trusting AI, so not only are they making it harder for content creators to make a living, but they are often getting hallucinatory or incorrect information. Since they never visit the original sources of information, they have little impetus to cross-check or verify what they read.

The impact of AI scraping

Cloudflare CEO Matthew Prince offered some devastating statistics. He used the ratio of the number of pages crawled compared to the number of pages fed to readers as a metric.

As a baseline, he said that 10 years ago, for every two pages Google crawled, it sent one visitor to a content creator’s site. Six months ago, that ratio was six pages crawled to one visitor sent to a content site. Now, just six months later, it’s 18 pages crawled to one visitor sent to a content site.

The numbers, according to Prince, are far worse for AI sites. AI sites derive substantial value from information they’ve scraped from all the rest of us. Six months ago, the ratio of pages scraped to visitors redirected via OpenAI was 250 to 1. Now, as people have become more familiar with trusting (or being too lazy to care about inaccuracies), the ratio is 1,500 to 1.

In many ways, AI is becoming an existential threat to content creators. By vacuuming up content produced by hard-working teams all across the world, and then feeding that content back as summaries to readers, the publishers and writers are losing revenue and influence. Many creators are also losing motivation, because if they can’t make a living doing it, or at least create a following, why bother?

Also: AI agents will threaten humans to achieve their goals, Anthropic report finds

Some publishers, like Ziff Davis (ZDNET’s parent company) and the New York Times, are suing OpenAI for copyright infringement. You’ve probably seen the disclaimer on ZDNET that says, “Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.”

Other publishers, including the Wall Street Journal, the Financial Times, the Atlantic, and the Washington Post, have licensed their content to OpenAI and some other AI large language models.

The damage to society as a whole that AI intermediation can cause is profound and worth an article all on its own. But this article is more practical. Here, we acknowledge the threat AI presents to publishing, and focus on technical ways to fight back.

In other words, if the AIs can’t scrape, they can’t give away published and copyrighted content without publishers’ permission.

Robots.txt: Your first defense

The simplest, most direct, and possibly least effective defense is the robots.txt file. This is a file you put at the root of your website’s directory. It tells spiders, crawlers, and bots whether they have permission to access your site. This is also called User-Agent filtering.

This file has a number of interesting implications. First, only well-behaved crawlers will pay attention to its specifications. It doesn’t provide any security against access, so compliance is completely voluntary on the part of the bots.

Also: 15 new jobs AI could create – could one be your next gig?

Second, you need to be careful which bots you send away. For example, if you use robots.txt to deny access to Googlebot, your site won’t get indexed for searching on Google. Say goodbye to all Google referrals. On the other hand, if you use robots.txt to deny access to Google-Extended, you’ll block Gemini from indexing and using your site for Gemini training.

This site has an index of those bots you might want to deny access to. This is OpenAI’s guide on how to prevent OpenAI’s bots from crawling your site.

But what about web scrapers that ignore robots.txt? How do you prevent them from scraping your site?

How can you prevent rogue scraping?

It’s here that site operators need to use a belts-and-suspenders strategy. You’re basically in an arms race to find a way to defend against scraping, while the scrapers are trying to find a way to suck down all your site’s data. In this section, I’ll list a few techniques. This is far from a complete list. Techniques change constantly, both on the part of the defenders and the scrapers.

Rate limit requests: Modify your server to limit how many pages can be requested by a given IP address in a period of time. Humans aren’t likely to request hundreds of pages per minute. This, like most of the techniques itemized in this section, will differ from server to server, so you’ll have to look up your server to find out how to configure this capability. It may also annoy your site’s visitors so much that they stop visiting. So, there’s that.

Use CAPTCHAs: Keep in mind that CAPTCHAs tend to inconvenience users, but they can reduce some types of crawler access to your site. Of course, the irony is that if you’re trying to block AI crawlers, it’s the AIs that are most likely to be able to defeat the CAPTCHAs. So there’s that.

Selective IP bans: If you find there are IP ranges that overwhelm your site with access requests, you can ban them at the firewall level. FireHOL (an open source firewall toolset) maintains a blacklist of IP addresses. Most of them are cybersecurity-related, but they can get you started on a block list. Be careful, though. Don’t use blanket IP bans, or legitimate visitors will be blocked from your site. So, there’s that, too.

Also: 5 ways you can plug the widening AI skills gap at your business

The rise of anti-scraping services

There are a growing number of anti-scraping services that will attempt to defend your site for a fee. They include:

  • QRATOR: Network-layer filtering and DDoS-aware bot blocking
  • Cloudflare: Reputation-tracking, fingerprinting, and behavioral analysis
  • Akamai Bot Manager: Identity, intent, and behavioral modeling
  • DataDome: Machine learning plus real-time response
  • HUMAN Security: JavaScript sensors with Al backend
  • Kasada: Adaptive challenges and so-called tamper-proof JavaScript telemetry
  • Imperva: Threat intelligence plus browser fingerprinting
  • Fastly: Rule-based filtering with edge logic
  • Fingerprint: Cross-session fingerprinting and user tracking
  • Link11: Behavioral analysis and traffic sandboxing
  • Netacea: Intent-based detection and server-side analytics

Here’s a quick overview of some of the techniques these services use.

Behavior matching: This technique analyzes more than headers; it analyzes request behavior. It’s essentially a combination of header analysis and bot-by-bot request limiting.

JavaScript challenges: Beyond JavaScript-based CAPTCHA, these often run in the background of a web page. They require scripts to execute or measure the pacing of interaction on the page to allow further access.

Honeypot traps: These are often elements buried in a web page, like invisible fields or links, that are designed to capture bots. If a bot grabs everything on a site (which a human user is unlikely to do), the honeypot trap recognizes it and initiates a server block.

Overall behavioral analysis: This is where AIs are fighting AIs. AIs running on behalf of your website monitor access behavior, and use machine learning to identify access patterns that are not human. Those malicious accesses can then be blocked.

Browser fingerprinting: Browsers provide a wide range of data about themselves to the sites they access. Bots generally attempt to spoof the fingerprints of legitimate users. But they often inadvertently provide their own fingerprints, which blocking services can aggregate and then use to block the bots.

Decoy traps: These are mazes of decoy pages filled with autogenerated and useless content, linked together in a pattern that causes bots to waste their time or get stuck following links. Most of those are tagged with “nofollow” links, so search engines don’t index them or negatively affect your SEO rank. Of course, malicious bots are learning how to identify these traps and counter them, but they do offer limited protection.

The big trade-off of blocking scraping for AI training

As an author who makes my living directly from my creative output, I find the prospect of AIs using my work as training data to be offensive. How dare a company like OpenAI make billions off the backs of all of us creatives! They then turn around and provide a product that could potentially put many of us out of work.

And yet, I have to acknowledge that AI has saved me time in many different ways. I use a text editor or a word processor every day. But back when I started my career, the publications I wrote for had typesetting operators who converted my written words into publishable content. Now, the blogging tools and content management systems do that work. An entire profession vanished in the space of a few years. Such is the price of new technology.

I’ve been involved with AI innovation for decades. After writing about generative AI since it boomed in early 2023, I’m convinced it’s here to stay.

Also: The most critical job skill you need to thrive in the AI revolution

AI chatbots like Google Gemini and ChatGPT are making token efforts to be good citizens. They scrape all our content and make billions off of it, but they’re willing to provide links back to our work for the very few who bother to check sources.

Some of the big AI companies contend that they provide value back to publishers. An OpenAI spokesperson told Columbia Journalism Review, “We support publishers and creators by helping 400M weekly ChatGPT users discover quality content through summaries, quotes, clear links, and attribution.”

Quoted in Digiday, David Carr, senior insights manager at data analytics company Similarweb, said, “ChatGPT sent 243.8 million visits to 250 news and media websites in April 2025, up 98% from 123.2 million visits this January.” 

Those numbers are big, but only without context. Google gets billions of visits a day, and before AI, nearly all those visits resulted in referrals out to other sites. With Google’s referral percentages dropping precipitously and OpenAI’s referral numbers being a very small percentage of traffic otherwise sent to content producers, the problem is very real.

Yes, those links are mere table scraps, but do we block them? If you enable web scraping blocks on your website, will it do anything other than “cut off your nose to spite your face,” as my mother used to say?

Also: Sam Altman says the Singularity is imminent – here’s why

Unless every site blocks AI scrapers, effectively locking AI data sets to 2025 and earlier, blocking your own site from the AIs will accomplish little more than preventing you from getting what little traffic there is from the AI services. So should you?

In the long term, this practice of AI scraping is unsustainable. If AIs prevent creatives from deriving value from their hard work, the creatives won’t have an incentive to keep creating. At that point, the quality of the AI-generated content will begin to decline. It will become a vicious circle, with fewer creatives able to monetize their skills and the AIs providing ever-worsening content quality.

So, what do we do about it? If we are to survive into the future, our entire industry needs to ask and attempt to answer that question. If not, welcome to Idiocracy.

What about you? Have you taken any steps to block AI bots from scraping your site? Are you concerned about how your content might be used to train generative models? Do you think the trade-off between visibility and protection is worth it? What kinds of tools or services, if any, are you using to monitor or limit scraping? Let us know in the comments below.


You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.



Source link

#companies #secretly #collecting #training #data #web #matters

Previous Post

$5B+ Startups Are Leading The Private-Market Herd In 2025

Next Post

Microsoft Says Its New AI System Diagnosed Patients 4 Times More Accurately Than Human Doctors

AiNEWS2025

AiNEWS2025

Next Post
Microsoft Says Its New AI System Diagnosed Patients 4 Times More Accurately Than Human Doctors

Microsoft Says Its New AI System Diagnosed Patients 4 Times More Accurately Than Human Doctors

Stay Connected test

  • 23.9k Followers
  • 99 Subscribers
  • Trending
  • Comments
  • Latest
A tiny new open source AI model performs as well as powerful big ones

A tiny new open source AI model performs as well as powerful big ones

0
Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

0
Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

0
Best Headphones for Working Out (2024): Bose, Shokz, JLab

Best Headphones for Working Out (2024): Bose, Shokz, JLab

0
Google Cloud and Palo Alto Networks sign deal worth nearly  billion

Google Cloud and Palo Alto Networks sign deal worth nearly $10 billion

2025-12-22
Bio-hybrid robots turn food waste into functional machines

Bio-hybrid robots turn food waste into functional machines

2025-12-22
This company is developing gene therapies for muscle growth, erectile dysfunction, and “radical longevity”

This company is developing gene therapies for muscle growth, erectile dysfunction, and “radical longevity”

2025-12-22
Understanding Vibe Proving | Towards Data Science

Understanding Vibe Proving | Towards Data Science

2025-12-22

Recent News

Google Cloud and Palo Alto Networks sign deal worth nearly  billion

Google Cloud and Palo Alto Networks sign deal worth nearly $10 billion

2025-12-22
Bio-hybrid robots turn food waste into functional machines

Bio-hybrid robots turn food waste into functional machines

2025-12-22
This company is developing gene therapies for muscle growth, erectile dysfunction, and “radical longevity”

This company is developing gene therapies for muscle growth, erectile dysfunction, and “radical longevity”

2025-12-22
Understanding Vibe Proving | Towards Data Science

Understanding Vibe Proving | Towards Data Science

2025-12-22
Footer logo

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow Us

Browse by Category

  • AI & Cloud Computing
  • AI & Cybersecurity
  • AI & Sentiment Analysis
  • AI Applications
  • AI Ethics
  • AI Future Predictions
  • AI in Education
  • AI in Fintech
  • AI in Gaming
  • AI in Healthcare
  • AI in Startups
  • AI Innovations
  • AI News
  • AI Research
  • AI Tools & Automation
  • Apps
  • AR/VR & AI
  • Business
  • Deep Learning
  • Emerging Technologies
  • Entertainment
  • Fashion
  • Food
  • Gadget
  • Gaming
  • Health
  • Lifestyle
  • Machine Learning
  • Mobile
  • Movie
  • Music
  • News
  • Politics
  • Review
  • Robotics & Smart Systems
  • Science
  • Sports
  • Startup
  • Tech
  • Travel
  • World

Recent News

Google Cloud and Palo Alto Networks sign deal worth nearly  billion

Google Cloud and Palo Alto Networks sign deal worth nearly $10 billion

2025-12-22
Bio-hybrid robots turn food waste into functional machines

Bio-hybrid robots turn food waste into functional machines

2025-12-22
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2025 JNews - Premium WordPress news & magazine theme by Jegtheme.

No Result
View All Result

© 2025 JNews - Premium WordPress news & magazine theme by Jegtheme.