Apple may have trained its AI models on thousands of YouTube videos
Apple, Anthropic and other major artificial intelligence (AI) companies have reportedly trained AI models on data from hundreds of thousands of YouTube videos. A new report claims that multiple AI companies used a publicly available dataset called Pile that contained the plain text of video captions without any video footage. The data was collected from popular YouTube creators like MrBeast, Marques Brownlee and PewDiePie, as well as Indian YouTube creators like CarryMinati, BB ki Vines and Ashish Chanchlani.
Multiple AI models reportedly trained on YouTube videos
Proof News conducted a study research to find that subtitle data for a whopping 1,73,536 YouTube videos came from over 48,000 channels. According to the report, EleutherAI, a non-profit AI research lab, compiled this dataset. It was later used by companies like Apple, Anthropic, Nvidia, Salesforce, and more. The AI lab notably published a research paper paper highlight the details of the dataset.
EleutherAI created an 800GB data repository called Pile and made it publicly available for those who wanted to train AI models but couldn’t afford large datasets. The majority of the dataset came from publicly available sources like the English Wikipedia, e-books, and more. However, it also included the subtitles of all videos, which were compiled into a dataset called YouTube Subtitles.
The report alleged that the Pile was used to train Apple’s OpenELM AI model, based on the description in the research paper. Salesforce, Nvidia, and Anthropic’s AI models’ research papers also reportedly mention use of the dataset.
Anthropic spokesperson Jennifer Martinez told the publication in a statement: “The Pile contains a very small subset of YouTube captions. YouTube’s terms of service cover direct use of the platform, which is different from use of the Pile dataset. Regarding potential violations of YouTube’s terms of service, we must refer you to the Pile authors.”
It is notable that YouTube’s terms of service to prohibit Anyone who accesses videos on the platform through automated means such as robots, botnets, or scrapers. YouTube captions fall under the scraping category. A Google spokesperson told Proof News in an emailed response that the tech giant “has taken steps over the years to prevent unauthorized scraping abuse.” However, no comment was made on the use of the data by AI companies.
In a post on X (formerly known as Twitter), Marques Brownlee criticized Apple for obtaining data from companies that contained the transcripts of his videos. However, he also stressed that it was not the iPhone maker’s fault, as they did not collect the data.
Apple has obtained data for their AI from several companies
One of them scraped tons of data/transcripts from YouTube videos, including mine
Technically, Apple is avoiding the word “mistake” here, because they’re not the ones setting the bar high.
But this will remain an evolving problem for a long time to come https://t.co/U93riaeSlY
— Marquis Brownlee (@MKBHD) July 16, 2024
While this dataset was collected and distributed publicly, there may be other instances of data scraping on platforms like YouTube. As AI companies scramble to find more data to train their large language models (LLMs), data procurement may continue to enter similar legal gray areas.