The Great AI Content Heist: How Bots Are Devouring the Internet

AI companies are vacuuming up online content at massive scale: A new investigation found over 15.8 million YouTube videos from 2+ million channels were scraped without permission by tech firms to train AI models ^[1]. Major players like Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, and others have downloaded huge datasets of videos and web content – often in violation of platform terms and copyright – to feed their AI systems ^[2]. These companies insist they “respect” creators and claim such usage is legal under current law ^[3], even as creators see their work taken without consent.
Web robots (“bots”) are bombarding sites constantly – sometimes overwhelmingly: Automated crawlers from AI labs now account for a staggering share of Internet traffic. Cloudflare reports about 30% of global web traffic comes from bots (automated scripts), even exceeding human traffic in some regions ^[4]. For open-source projects and smaller sites, the onslaught is extreme – up to 97% of traffic on some sites now originates from AI company bots crawling for data ^[5]. In one case, OpenAI’s “GPTBot” crawler hit a forum with ~18,000 pageviews in 24 hours, far eclipsing Google’s bot activity ^[6]. Some bots hit websites at high frequencies (hundreds of requests per second), straining servers and bandwidth.
Many AI scrapers ignore the rules and disguise themselves: Robots.txt files (standard “do not crawl” directives) are often disregarded by these AI-driven crawlers ^[7]. They frequently spoof their identities and cycle through IP addresses to evade detection ^[8]. One developer noted, “It’s futile to block AI crawler bots because they lie, change their user agent, use residential IPs as proxies, and more.” ^[9] This stealthy behavior lets them bypass basic defenses and continuously harvest content – even from sites that explicitly opt out.
Publishers and creators are fighting back with tech and legal tactics: To defend their content, website owners are deploying stricter anti-bot measures – from requiring logins and CAPTCHAs, to custom “proof-of-work” challenges that force bots to solve puzzles ^[10] ^[11]. Some communities have even blocked entire countries or IP ranges when overwhelmed by scrapers ^[12]. Others experiment with “honeypot” traps or tar pits that lure and bog down bots with endless fake content. Creators and media publishers are also pursuing lawsuits and legislation to require consent or compensation for AI training data, arguing that uncontrolled scraping is essentially theft of intellectual property. (Notably, YouTube’s own terms forbid mass downloading, but the platform has done little so far to stop the massive scraping of videos ^[13].)
Cloudflare is emerging as a key ally by offering anti-scraping solutions: The web infrastructure company Cloudflare, which sits in front of millions of sites, has rolled out tools to mitigate the AI bot deluge. In late 2024 it gave websites a “one-click” option to block known AI crawlers from accessing their content ^[14]. In 2025, Cloudflare went further, unveiling an “AI Labyrinth” feature that traps unauthorized bots in a maze of junk content ^[15]. Rather than simply blocking scrapers (which might tip them off), the Labyrinth feeds AI bots realistic-looking but irrelevant pages to waste their time and resources ^[16] ^[17]. Cloudflare also announced a pay-per-crawl program to make AI firms pay for each scrape, aiming to compensate creators and curb unlimited scraping ^[18] ^[19]. Even some AI companies are cooperating with this plan, acknowledging that a sustainable content ecosystem is in their interest ^[20].

AI Companies Are Scraping the Web for Everything – Without Asking

The rise of generative AI has kicked off an arms race for data, as AI companies seek to ingest as much online content as possible to train their models. Text from websites, images, code repositories, music – and now video – are all being vacuumed up. A bombshell report from The Atlantic in September 2025 revealed the sheer scale of this activity on YouTube: more than 15.8 million videos (from over 2 million channels) were quietly scraped and downloaded without permission as training data for AI ^[21]. These weren’t obscure clips either – nearly 1 million were how-to videos, and countless others came from popular creators and even major organizations like the BBC and TED ^[22] ^[23]. In many cases the videos were stripped of titles or creator names in the datasets to obscure their origin ^[24], but investigators traced the data back to real YouTube channels.

Crucially, this mass downloading violates YouTube’s terms of service – yet it has been happening largely unchecked ^[25]. AI developers have used third-party tools to rip videos en masse. YouTube appears to have done little, if anything, to stop the mass downloading, according to The Atlantic, and the company declined to comment on the situation ^[26]. In other words, even the largest content platforms are struggling (or failing) to prevent their content from being siphoned off into AI training sets.

And it’s not just YouTube. Virtually the entire open web is being crawled for data. Tech giants and startups alike have deployed web crawlers (bots) that scour websites for text, images, and other data to feed into AI models. The companies involved read like a who’s who of Big Tech and AI: Microsoft, Meta (Facebook), Amazon, Nvidia, ByteDance (TikTok’s owner), Snap, Tencent, OpenAI, Anthropic, Runway and more. According to The Atlantic’s research, “Many major tech companies have used these data sets to train AI” – and when asked, most declined to comment or justify it ^[27]. Only a few (Meta, Amazon, Nvidia) responded, essentially claiming they “respect content creators” and believe using public data in this way is legal under copyright law ^[28]. In practice, these firms are exploiting a gray area: copyright and platform rules weren’t designed for AI training, and until courts or laws catch up, AI companies are racing to gather as much content as possible.

Why are they doing this? AI models like GPT-4, DALL-E, Stable Diffusion, and the next generation of AI video generators need enormous datasets to learn. The more text or video they ingest, the more patterns they can detect, which supposedly makes them smarter or more capable. For AI that produces images or video, for example, developers seek out high-quality visual content – one leaked spreadsheet from a video AI startup (Runway) showed they specifically targeted videos with “beautiful cinematic landscapes” and “high quality scenes”, even labeling one creator’s channel as “THE HOLY GRAIL” of content to copy ^[29]. In short, your posts, articles, or videos might be exactly what an AI needs to learn a skill – and so the bots come crawling.

The Bot Onslaught: How Often Web Crawlers Visit (Hint – It’s a LOT)

All this AI data hunger translates into a flood of bot traffic hitting websites around the clock. If you run a website or online service, chances are a significant chunk of your traffic isn’t human readers at all, but automated bots scraping your content. Cloudflare, a major Internet infrastructure firm, estimates that about 30% of all web traffic today comes from bots (good and bad combined) – and in some regions, automated traffic even exceeds human traffic ^[30]. These bots include everything from “good” bots like search engine indexers (Googlebot, Bingbot) to “bad” bots like spammers, content scrapers, and hackers. Recently, a new category of bots – AI training crawlers – has exploded in activity, adding to the load ^[31].

In fact, data suggests that AI crawlers have ramped up dramatically in the past year. Cloudflare’s analysis of crawler trends from mid-2024 to mid-2025 found that OpenAI’s GPTBot emerged as a dominant crawler, surging from about 5% to 30% of all AI crawler traffic in just that period ^[32]. Meta’s new crawler also shot up to nearly 20% share. Meanwhile other bots like Anthropic’s Claude bot, Amazon’s crawler, and ByteDance’s “Bytespider” were also in the mix ^[33]. In short, OpenAI and a few others are now hitting sites even more than many search engines do. For example, anecdotal reports from web admins in 2023–2024 showed that OpenAI’s GPTBot was making tens of thousands of requests to certain sites – in some cases outpacing Google’s web crawler by a wide margin ^[34]. One forum admin observed GPTBot hitting ~18,000 pages in a day, versus ~1,700 by Googlebot ^[35]. Another site admin said if that rate continued, GPTBot alone would clock nearly 3 millionpage fetches per year from their single site ^[36].

Small wonder then that some maintainers feel under siege. A comprehensive report from LibreNews (cited by Ars Technica) noted that some open-source project sites saw as much as 97% of their traffic coming from AI company bots rather than humans ^[37]. In effect, real users were just a tiny blip compared to the constant swarm of scrapers. This deluge isn’t just an abstract inconvenience – it drives up bandwidth costs, overloads servers, and can even knock sites offline. In one case, a developer’s self-hosted code repository (a Gitea server run by Xe Iaso) was hammered so hard by Amazon’s AI crawler that it caused repeated outages ^[38]. The usual defenses – updating robots.txt to say “go away”, blocking known bot user-agent strings, IP-banning obvious culprits – all failed, because the scrapers evolved tactics to evade them ^[39]. They rotated through residential IP proxies (making the traffic look like ordinary users), and faked their identities as benign agents ^[40]. As Iaso put it, “they lie…change their user agent…use residential IPs… and more”, making it nearly impossible to keep them out by traditional means ^[41].

Another open-source developer likened the effect to a slow-motion DDoS attack: these bots hit sites so frequently and broadly that they consume huge chunks of server capacity, sometimes overwhelming community-run services on limited budgets ^[42]. Kevin Fenzi, a sysadmin for a Fedora Linux project, reported they ultimately had to block all traffic from certain regions (like Brazil) because they couldn’t stem the bot tide any other way ^[43]. The GNOME project’s servers implemented a drastic challenge-response filter (a proof-of-work “are you human?” test dubbed Anubis), and when they did so, they found only ~3% of requests were legitimate people – 97% got flagged as bots and failed the test ^[44]. This gives a sense of just how much crawler noise was hitting them. (However, such extreme measures can inconvenience real users too – some mobile users reportedly had to wait minutes for the puzzle to load ^[45].)

It’s not just text and code sites. Even Wikimedia (Wikipedia’s infrastructure) has reported unusual surges in bandwidth usage in recent years, believed to be from AI scraping. And AI model makers don’t necessarily obey polite conventions like crawl delays (to slow their rate) or limits. A user on one technical forum complained that Anthropic’s AI crawler was particularly aggressive – “the worst offender by far”, seemingly “determined to destroy the internet” with how hard it hit sites ^[46]. That colorful remark underscores the resentment building in many corners of the web: from the perspective of someone running a site, these bots can feel like invaders chewing up resources without permission.

How Publishers and Creators Are Defending Themselves

Faced with an onslaught of unsanctioned scraping, content publishers and independent creators are experimenting with ways to protect their work. There are both technological defenses and legal/business responses emerging.

On the tech side, many website owners first rely on traditional measures: updating their robots.txt files to disallow AI crawlers, and blocking known bot user-agent names like GPTBot or ClaudeBot. Unfortunately, as noted, the more aggressive scrapers simply ignore robots.txt rules ^[47] or disguise themselves. That has led to more drastic steps. Some admins use rate-limiting and firewall rules (for example, Cloudflare or server-level rules) to cap how many requests a single IP or client can make. Others deploy CAPTCHAs or login walls to at least make scraping harder. The open-source community, hit hard by bot traffic, has gotten especially creative: the Anubis proof-of-work challengementioned earlier is one example ^[48]. It forces each visitor to expend significant computation (solving a puzzle) before the site will respond – which slows down bots dramatically (and deterred 97% of them in GNOME’s case) ^[49]. Another experimental tactic is setting up “tar pits” or honeypots: pages that deliberately trap bots in a loop. One such project, Nepenthes, was discussed on tech forums as a way to ensnare AI crawlers in an endless maze of junk data. The idea is to waste the bot’s time and resources without giving it anything useful – a form of revenge or friction-adding. However, such DIY traps carry risks (they might also snare or confuse legitimate crawlers like search engines, potentially causing a site to get de-listed on Google ^[50]).

Beyond pure tech barriers, publishers are increasingly pushing for compensation or legal recourse. There’s a growing sentiment that if AI companies want to use publishers’ content, they should ask and perhaps pay for it, rather than just scrape it. For instance, some major news organizations have begun striking licensing deals with AI firms (the Associated Press inked a deal with OpenAI in 2023 to license its news stories for training data, rather than have them scraped indiscriminately). Book publishers and media companies are also lobbying for clearer copyright protections. Dozens of lawsuits have been filed by authors, artists, and even source code owners, arguing that wholesale data scraping to train AI violates intellectual property rights. These cases (some still in progress) will test whether AI training falls under “fair use” or not ^[51]. Early signs are mixed – some judges have expressed skepticism about the “fair use” defense for AI scraping ^[52], but definitive legal standards are still emerging.

In the meantime, collective action is also on the rise. Content platforms and creators are banding together to demand better protections. For example, during the Hollywood writers’ and actors’ strikes in 2023, one issue on the table was limiting AI’s use of scripts and likenesses without consent ^[53]. Now, with this new evidence of massive video scraping, filmmakers and YouTubers are calling for stronger laws to require consent, transparency, and payment when AI systems mine their work ^[54] ^[55]. As the No Film School article put it, this is becoming an “existential threat to creative professions”, and creators argue it’s not about stopping technology but about demanding consent and compensation for the use of their content ^[56] ^[57].

For individual site owners, legal battles by giants may not immediately help their bandwidth bills. So many are taking matters into their own hands with the tools available – or even deciding to retreat. Some have made their communities private or login-only, to keep scrapers out. Others have considered blocking all unknown traffic and only allowing known browsers, though that’s difficult. It’s an ongoing cat-and-mouse game: as publishers harden their sites, the scrapers adapt.

Cloudflare’s Tools: Blocking, Trapping, and Now Charging the Bots

Amid this struggle, Cloudflare – which powers traffic for a huge swath of websites (from small blogs to big enterprises) – has stepped up as a key player in mitigating AI scraping. Cloudflare sits between websites and the internet, acting as a combined content delivery network and security shield. This vantage point lets them detect bot patterns at scale and intervene on behalf of site owners.

One of Cloudflare’s first moves came in September 2024, when they introduced a simple “one-click AI bot blocking”setting ^[58]. This feature, available to any site using Cloudflare, allowed owners to automatically block known AI crawlers (like those identified as GPTBot, etc.) with a toggle. It was essentially an update to Cloudflare’s bot management, recognizing that many customers explicitly wanted to keep AI scrapers out. While helpful, Cloudflare recognized this was only a partial fix – sophisticated bots could spoof their identity to bypass such a block, and outright blocking could also signal to the bot operators that they’d been caught (potentially prompting them to try different tactics).

So in 2025, Cloudflare unveiled a more novel approach: turning the tables on the bots. In March 2025 it announced “AI Labyrinth”, a feature designed not just to block crawlers but to actively stall and mislead them ^[59]. If Cloudflare’s system detects an unauthorized AI scraper on a protected site, it doesn’t simply return an error. Instead, it serves up a series of AI-generated web pages filled with plausible-sounding but irrelevant content ^[60]. The idea is to create a “maze” of information that only the bot will follow – wasting its crawling time and compute power. “When we detect unauthorized crawling… we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them,” Cloudflare explained of the Labyrinth ^[61]. These decoy pages contain neutral, generic text (so as not to accidentally poison the bot with misinformation, but also not to give it anything valuable) and include meta tags telling search engines not to index them ^[62]. That way, human users or Google’s crawler won’t stumble into the maze – but a rogue AI bot that ignores instructions will gladly wander in and get bogged down. Importantly, Cloudflare noted that just blocking a bot can sometimes alert the scrapers that they’ve been spotted, whereas feeding them an endless maze keeps them occupied without raising alarm ^[63]. It’s an arms-length form of defense – a clever trick or trap. According to Cloudflare, by early 2025 AI scrapers were generating over 50 billion requests per day on Cloudflare’s network – nearly 1% of all traffic the company handles ^[64]. Labyrinth aims to make those billions of unwanted bot requests as unproductive as possible for the scrapers.

Beyond blocking and trapping, Cloudflare is also piloting a solution to bridge the gap between AI firms and content creators: making the bots pay up. In mid-2025, Cloudflare announced a “pay-per-crawl” program that would charge AI companies each time their bot scrapes a site, with the fees funnelled to the content publisher ^[65] ^[66]. Cloudflare’s CEO Matthew Prince said this feature is about ensuring “the Internet as we know it will survive the age of AI”, by funding creators and incentivizing AI companies to collaborate rather than just take ^[67]. “Original content is what makes the Internet great… AI crawlers have been scraping content without limits. Our goal is to put the power back in the hands of creators, while still helping AI companies innovate,” Prince explained ^[68]. In other words, if an AI wants to crawl your site, it should negotiate and pay for the privilege, creating a more sustainable model. Some publishers expressed optimism that this pay-per-crawl system could stem the endless scraping that feels like theft of their work ^[69]. Surprisingly, even a few AI companies are reportedly on board with testing this arrangement ^[70]. Cloudflare indicated it has partnered with certain AI firms willing to use a unified interface to compensate content owners and get reliable access ^[71]. The logic for the AI side is that a “long-term collaboration” with creators will ensure they have fresh, high-quality data to train on, and avoid the public relations and legal headaches of data theft ^[72]. As Cloudflare’s blog noted, without a healthy ecosystem of original content, AI models risk becoming outdated, biased, or untrusted ^[73]. In short, Cloudflare is trying to broker a peace: let the bots in, but only through the front door and with a toll – instead of sneaking in the back window.

All these efforts – blocking, labyrinth trapping, and pay-per-call tolls – make Cloudflare a central figure in the fight against unrestricted AI scraping. However, it’s worth noting that these solutions are mostly opt-in (sites have to decide to enable them) and they’re early-stage. The Labyrinth approach is innovative, but time will tell how well it fools the next generation of AI crawlers or whether they find ways out of the maze. The pay-per-crawl model too is in beta, depending on voluntary participation and perhaps the honor of AI companies. Still, it’s a start, and it shows the industry acknowledging the problem.

Balancing Innovation and Fairness

The battle over AI scraping ultimately boils down to a fundamental question: how do we balance the incredible potential of AI with the rights and needs of content creators and publishers? On one hand, training AI on vast swaths of internet data has enabled amazing breakthroughs – from chatbots that can answer complex questions to AI that can generate images and videos on demand. On the other hand, this progress has been built on uncompensated use of human-made content, raising ethical and economic dilemmas. If creators large and small can’t trust that their work (their articles, art, videos, code) won’t just be gobbled up for AI profits, the incentive to create and share openly diminishes. As one filmmaker lamented, “Every frame of your work that [AI companies] ingest is used to build a tool to replace you… they are not augmenting human creativity; they are automating it to cut costs.” ^[74] The fear in creative communities is palpable.

We are now seeing the first attempts to find a new equilibrium. Technological defenses like anti-bot systems and industry initiatives like Cloudflare’s can provide immediate relief and set norms (e.g. that scraping shouldn’t be completely free or unchecked). Meanwhile, legal and legislative efforts are underway that could redefine what AI companies are allowed to do. There are calls for clear regulations that require consent or licensing for using copyrighted data in AI training ^[75]. Publishers are negotiating as a group in some cases, and regulators in Europe and elsewhere are eyeing rules that might force transparency (such as labeling AI outputs and disclosing training data sources).

For now, the scraping frenzy continues, but awareness is at an all-time high. Websites are monitoring their traffic and openly sharing stories of bot barrages. Tools to detect AI crawlers (by their behavior or known IP ranges) are improving. And even AI developers themselves have acknowledged the issue – OpenAI, for instance, published an official GPTBot policy in 2023 with instructions for sites on how to block it (a sign they know not everyone wants to be scraped). Nonetheless, many feel that self-policing by AI firms isn’t enough; concrete external action is needed.

In summary, the Internet is facing a “great AI content heist”, but it’s not an unstoppable crime spree. Site owners and tech providers are arming themselves with better locks and clever traps. Publishers are demanding a seat at the table – and a share of the value – if their content is used for AI. Cloudflare’s CEO put it simply: without support for creators, “the future of a free and vibrant Internet” is at risk ^[76]. The coming months and years will determine whether we can rein in the bots and strike a fair bargain, or whether AI’s appetite will fundamentally change the open web as we know it. For the public and policymakers, the message is clear: pay attention to the unseen bots crawling every corner of the web, because they’re not just innocently browsing – they’re building the brains of new AI, and the costs of that feast are being borne by everyone else.

Sources: The Atlantic (AI Watchdog) ^[77] ^[78]; No Film School ^[79] ^[80]; Cloudflare Blog ^[81] ^[82]; Ars Technica via Slashdot ^[83] ^[84]; Daily Nous (comments) ^[85] ^[86]; Alex Seifert’s Notebook ^[87] ^[88].