New data shows that AI companies love ‘Premium Publisher’ content
OpenAI, Google, Meta and Anthropic all rely heavily on content from premium publishers to train the large language models, or LLMs, that are at the heart of their AI efforts, even though these companies have regularly underestimated their use of such copyrighted content, according to new research released this week by online publishing giant Ziff Davis.
Ziff Davis owns CNET, as well as a numerous other brandsincluding IGN, PCMag, Mashable, and Everyday Health.
An article describing the research, written by George Wukoson of Ziff Davis, chief AI attorney, and Chief Technology Officer Joey Fortuna, reports that AI companies have deliberately filtered out low-quality content in favor of high-quality, human-made content. content to train their models. Since AI companies want their models to perform well, it makes sense that they prefer quality content in their training data. AI companies used websites’ domain authority, or essentially their ranking in Google search results, to make this distinction. In general, sources that filter higher on Google tend to be of higher quality and more reliable.
The companies behind popular AI chatbots like ChatGPT and Gemini have been secretive about where they get the information that powers the answers the bots give you. This is not helpful to consumers, who do not gain insight into the sources, their reliability, and whether the training data may be biased or perpetuate harmful stereotypes.
But it’s also a point of major contention with publishers, who say AI companies are essentially illegally copying the copyrighted work they own, without permission or compensation. While OpenAI has licensed content from some publishers as it transforms from a nonprofit to a for-profit company, other media companies are suing ChatGPT’s creator for copyright infringement.
“Large LLM developers no longer make their training data public as they used to. They are now more commercial and less transparent,” Wukoson and Fortuna wrote.
OpenAI, Google, Meta and Anthropic did not immediately respond to requests for comment.
Publishers included The New York Times have sued Microsoft and OpenAI for copyright infringement, while the Wall Street Journal and New York Post are publishers Dow Jones files a lawsuit Perplexity, another generative AI startup, on similar grounds.
Big Tech has seen massive appreciation during the AI revolution. Google is currently valued at approximately $2.2 trillionAnd Meta is valued at approximately $1.5 trillionpartly because of their work with generative AI. Investors are currently valuing startups OpenAI and Anthropic $157 billion And $40 billionrespectively. News publishers are now having a hard time and have been forced to do so waves of layoffs in recent years. News publishers are struggling in a highly competitive online media environment as they try to navigate through the noise of online search. AI-generated “slop” and social media to find an audience.
Meta CEO Mark Zuckerberg said in an interview that creators and publishers “overestimate the value of their specific content.” interview with The Verge earlier this year.
Meanwhile, some AI companies have entered into licensing agreements with publishers to provide their LLMs with timely news articles. OpenAI signed a deal with the Financial Times, DotDash Meredith, Vox and others earlier this year. Meta And Microsoft have also signed agreements with publishers. Ziff Davis has not signed a similar deal.
Based on an analysis of AI companies’ disclosures about their older models, Wukoson and Fortuna found that URLs from top publishers such as Axel Springer (Business Insider, Politico), Future PLC (TechRadar, Tom’s Guide), Hearst (San Francisco) Chronicle , Men’s Health), News Corp (The Wall Street Journal), The New York Times Company, The Washington Post and others made up 12.04% of the training data, at least for the OpenWebText2 dataset. OpenWebText2 was used to train GPT-3, which is the underlying technology for ChatGPT, although the latest version of ChatGPT is not built directly on top of GPT-3 and is standalone.
Neither OpenAI, Google, Anthropic nor Meta have publicly released training data used to train their most recent models.
Each of the various trends discussed in the research paper “reflects LLM firms’ decisions to prioritize high-quality web text datasets when training LLMs, resulting in revolutionary technology breakthroughs that deliver tremendous value to those firms,” Wukoson and Fortuna wrote .