News Group Says A.I. Chatbots Heavily Rely on News Content

Uncategorized

News publishers have argued for the past year that A.I. chatbots like ChatGPT rely on copyrighted articles to power the technology. Now the publishers say developers of these tools disproportionately use news content.

The News Media Alliance, a trade group that represents more than 2,200 publishers, including The New York Times, released research on Tuesday that it said showed that developers outweigh articles over generic online content to train the technology, and that chatbots reproduce sections of some articles in their responses.

The group argued that the findings show that the A.I. companies violate copyright law.

“It’s an exacerbation of an existing problem,” said Danielle Coffey, the president and chief executive of the News Media Alliance, which has argued for years that tech companies like Google do not fairly compensate news organizations for displaying their work on online services.

Representatives for Google and OpenAI, the maker of ChatGPT, did not immediately respond to requests for comment.

Generative artificial intelligence, the technology behind chatbots, exploded into the mainstream late last year with the release of ChatGPT, a chatbot that can answer questions or complete tasks using information digested from the internet and elsewhere. Other tech companies have released their own versions since.

It is impossible to know exactly what data is fed into the large learning models because many have not publicly confirmed what is used. In its analysis, the News Media Alliance compared public data sets believed to be used to train the most well-known large language models, which underpin A.I. chatbots like ChatGPT, with an open-source data set of generic content scraped from the web.

The group found that the curated data sets used news content five to 100 times more than the generic data set. Ms. Coffey said those results showed that the people building the A.I. models valued quality content.

The report also found instances of the models directly reproducing language used in news articles, which Ms. Coffey said showed that copies of publishers’ content were retained for use by chatbots. She said that the output from the chatbots then competes with news articles.

“It genuinely acts as a substitution for our very work,” Ms. Coffey said, adding: “You can see our articles are just taken and regurgitated verbatim.”

The News Media Alliance has submitted the findings of the report to the U.S. Copyright Office’s study of A.I. and copyright law.

“It demonstrates that we would have a very good case in court,” Ms. Coffey said.

Ms. Coffey added that the News Media Alliance was actively exploring the collective licensing of content from its members, which include some of the biggest news and magazine publishers in the country.

Media executives have raised a number of concerns about A.I. in addition to the use of articles to train language models. Traffic to news sites from search engines could dwindle, some executives fear, if chatbots become a primary search tool. In addition, many media workers are worried that they could be replaced by A.I.