Skip to content

No, Apple Intelligence wasn’t trained on transcribed YouTube videos and subtitles

2024 July 19
by RSS Feed

Speculation that Apple scraped YouTube videos and subtitles to train its artificial intelligence (AI) models is false, here’s why.

Apple has addressed concerns raised by Wired that companies like Apple and Nvidia have harvested thousands of YouTube videos without permission for AI training. The iPhone maker confirmed to 9to5Mac that it hasn’t used YouTube content to train models powering its upcoming generative AI features.

In April, Apple opened-sourced its large on-device language models called Open-source Efficient Language Models (OpenELM) on the Hugging Face Hub community where developers share their AI code and its Machine Learning Research blog.

Apple Intelligence wasn’t trained on YouTube data

And now, the company tells 9to5Mac that it’s only used OpenELM for research purposes and has not incorporated these models into Apple Intelligence.

“Apple says that it created the OpenELM model as a way of contributing to the research community and advancing open source large language model development,” the publication wrote.

Apple’s comment is a reaction to a recent article in Wired. That publication conducted an investigation, finding that “subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple and Salesforce.”

The YouTube Subtitles dataset includes transcripts of videos from popular educational channels; news outlets like the BBC and NPR; YouTube personalities like MKBHD, MR. Beast and PewDiePie; and television shows like Last Week Tonight With John Oliver, Jimmy Kimmel Live and The Late Show With Stephen Colbert.

So, what exactly is going on here?

Long story short, the companies in question did not scrape this content themselves. Instead, these industry players contracted another company called EleutherAI to create the YouTube Subtitles dataset.

While Apple only used this dataset during OpenELM development, that still doesn’t change the fact that EleutherAI used YouTube content without permission.

It also doesn’t mean Apple scraped online data without permission. As the company previously made it clear, Apple Intelligence models were trained on licensed data.

This includes data “selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot,” the company clarified, adding that publishers can easily opt-out of this by adding appropriate to the site’s robots.txt file, as explained by Apple’s support page.

Source link: https://www.idownloadblog.com/2024/07/19/apple-intelligence-youtube-subtitles-clarification/

Leave a Reply

Note: You may use basic HTML in your comments. Your email address will not be published.

Subscribe to this comment feed via RSS