The promised artificial intelligence revolution requires data. Lots and lots of data. OpenAI and Google have started using YouTube videos Train their text-based AI models. But what does the YouTube archive actually include?
Our team Digital media researchers At the University of Massachusetts Amherst, he collected and analyzed random samples of YouTube videos to learn more about that archive. We published one 85 page paper That dataset and setup a A website called TubeStats For researchers and journalists who need basic information about YouTube.
Now, we’re taking a closer look at some of our surprising findings to better understand how these obscure videos can become part of powerful AI systems. We found that many YouTube videos were created for individual use or for small groups, and a significant number were created by children under the age of 13.
YouTube is the tip of the iceberg
Most of YouTube’s experience is algorithmically curated: Up to 70% videos User Watch is recommended by the site’s algorithms. Recommended videos are popular content like influencer stunts, news clips, explainer videos, travel vlogs, and video game reviews, while non-recommended content is more obscure.
Some YouTube content follows popular creators or fits into established categories, but most is unique: family celebrations, selfies set to music, homework, out-of-context video game clips, and children dancing. YouTube’s Obscure Page – The Most of the time The 14.8 billion videos are estimated Created and uploaded to the platform – available Not properly understood.
Making light of this aspect of YouTube — and social media in general — has been difficult since the big tech companies have changed Increasingly Hostile to do researchers.
We find that many videos on YouTube are never widely shared. We’ve documented thousands of small individual videos with low views but high engagement – likes and comments – indicating a small but highly engaged audience. These were clearly intended for small audiences of friends and family. Such social uses of YouTube contrast with videos trying to grow their audience, suggesting another way of using YouTube: a video-centric social network for small groups.
Other videos seem to be for a different kind of smaller, static audience: classes recorded from pandemic-era virtual instruction, school board meetings, and work meetings. Although most people don’t think of social apps as such, they imply that their creators have one A different expectation about the audience For videos, rather than creators of content that people see in their recommendations.
Fuel for the AI engine
It is with this broader understanding that we read The New York Times exposé How OpenAI and Google turned to YouTube In a race to find new data to train their large language models. An archive of YouTube transcripts forms an unusual dataset for text-based models.
There is also speculation, Partially fueled through a evasive response Videos can be used to train AI text-to-video models like OpenAI’s, from Meera Murati, chief technology officer at OpenAI. Sora.
The New York Times story raised concerns about YouTube’s terms of service and copyright issues that permeate much of the debate about AI. But there’s another problem: How does anyone know what’s actually in the archive of more than 14 billion videos uploaded by people around the world? It’s not entirely clear that Google knows or should know.
Children as content creators
We were surprised by the unsettling number of videos featuring or created by children. YouTube needs uploaders Must be at least 13 years oldBut we often saw children who looked much younger than that, usually dancing, singing or playing video games.
In our preliminary research, our coders determined that one-fifth of random videos contained at least one face of someone under the age of 13. We do not take into account videos that are clearly taken with the consent of a parent or guardian.
Our current sample size of 250 is relatively small — we’re working on coding a much larger sample — but the findings so far are consistent with what we’ve seen in the past. We don’t mean to bash Google. Online age verification Notoriously hard and full, and there is no way to determine whether these videos were uploaded with the consent of a parent or guardian. But we want to underline what the AI models of these big companies are consuming.
Small reach, big influence
It’s tempting to assume that OpenAI uses mass-produced influencer videos or TV newscasts published on the platform to train its models, but Previous research Large language model training data shows that the most popular content is not always the most influential for training AI models. A virtually unseen conversation between three friends may have more linguistic value for training a chatbot language model than a music video with millions of views.
Unfortunately, OpenAI and other AI companies are very opaque about their training materials: they don’t specify what is included and what isn’t. Most of the time, researchers can infer problems with training data through biases in the output of AI systems. But when it comes to taking a look at the training data, there is often cause for concern. For example, Human Rights Watch issued a statement On June 10, 2024, a popular training dataset shows that there are many photos of recognizable children.
The history of big tech self-regulation is filled with moving goal posts. OpenAI in particular is notorious for listening Forgiveness rather than permission and faced Incremental criticism for Putting profit over safety.
Concerns generally center around using user-generated content to train AI models Intellectual Property, but there are also privacy issues. YouTube is a vast, unregulated archive that cannot be fully reviewed.
Models trained on a subset of professionally produced videos can be an AI company’s first training material. But without strong policies, any company that consumes more than the popular tip of the iceberg may be covering content that violates the Federal Trade Commission’s regulations. Children’s Online Privacy Protection ActIt prevents companies from collecting data from children under 13 without prior notice.
With last year Executive Order on AI And At least a promising proposition On the table for comprehensive privacy legislation, there are signs that legal protections for user data will be much stronger in the United States.
Have you unknowingly helped ChatGPT practice?
A YouTube uploader’s intentions may not be as consistent or predictable as someone publishing a book, writing an article for a magazine, or displaying a painting in a gallery. But even if YouTube’s algorithm ignores your upload and it doesn’t get more than two views, it can be used to train models like ChatGPT and Gemini.
As far as AI is concerned, your family reunion video may be just as important as a video uploaded by an influencer. Mr. Beast Or CNN.