Overview
- An investigation by The Atlantic reports that Common Crawl’s scraper avoids executing client-side paywall code, leading to full-text capture of millions of paywalled news articles.
- Common Crawl published a denial stating its CCBot only collects publicly accessible pages, does not log in to sites, and does not bypass access restrictions.
- The reporting says AI companies including OpenAI, Google, Anthropic, Meta, Amazon, and Nvidia have used Common Crawl’s archive to train large language models.
- Publishers have attempted to block the crawler and have requested removals, but the investigation says takedowns have not been fulfilled and the archives have not been modified since 2016, which the foundation attributes to immutable file formats.
- The Atlantic alleges the foundation misled publishers and masked archive contents, and quotes founder Doug Skrenta as minimizing the importance of individual outlets and saying, “The robots are people too.”