Social media platform Reddit said Tuesday it will update a web standard the platform uses to block automated data collection from its website, after reports that AI startups were circumventing the rule to collect content for their systems.
The move comes as artificial intelligence companies are accused of plagiarizing content from publishers to create AI-generated summaries without attribution or permission.
Reddit has announced that it will be updating the Robots Exclusion Protocol (or “robots.txt”), a widely accepted standard that dictates which parts of a site can be crawled.
The company also said it will enforce rate limiting, a technique used to control the number of requests from one specific entity, and will block unknown bots and crawlers that seek to scrape data from the website (collect and store raw information).
Recently, robots.txt has become an important tool publishers use to prevent tech companies from using their content for free to train AI algorithms and create summaries in response to certain search queries.
Last week, content licensing startup TollBit wrote a letter to publishers alleging that several AI companies were circumventing web standards to scrape publishers’ sites.
This is the result of an investigation by Wired that found that AI search startup Perplexity likely bypassed attempts to block its web crawler via robots.txt.
Earlier in June, business media publisher Forbes accused Perplexity of plagiarizing its investigative stories for use in generative AI systems, without attribution.
Reddit reported Tuesday that researchers and organizations such as the Internet Archive will continue to have access to the content for non-commercial use.
© Thomson Reuters 2024