girlfreddy@lemmy.ca to

Technology@beehaw.org · 3 months ago

Leaked: Nvidia's AI Scraping Pipeline

www.404media.co

3

41

Leaked: Nvidia's AI Scraping Pipeline

www.404media.co

girlfreddy@lemmy.ca to

Technology@beehaw.org · 3 months ago

3

Internal emails, Slack conversations and documents obtained by 404 Media show how Nvidia created a yet-to-be-released video foundational model.

Nvidia scraped videos from Youtube and several other sources to compile training data for its AI products, internal Slack chats, emails, and documents obtained by 404 Media show.

When asked about legal and ethical aspects of using copyrighted content to train an AI model, Nvidia defended its practice as being “in full compliance with the letter and the spirit of copyright law.” Internal conversations at Nvidia viewed by 404 Media show when employees working on the project raised questions about potential legal issues surrounding the use of datasets compiled by academics for research purposes and YouTube videos, managers told them they had clearance to use that content from the highest levels of the company.

Emails from the project’s leadership to employees show that the goal of Cosmos (different from the company’s existing Cosmos deep learning product) was to build a state-of-the-art video foundation model “that encapsulates simulation of light transport, physics, and intelligence in one place to unlock various downstream applications critical to NVIDIA.”

Slack messages from inside a channel the company set up for the project show employees using an open-source YouTube video downloader called yt-dlp, combined with virtual machines that refresh IP addresses to avoid being blocked by YouTube. According to the messages, they were attempting to download full-length videos from a variety of sources including Netflix, but were focused on YouTube videos. Emails viewed by 404 Media show project managers discussing using 20 to 30 virtual machines in Amazon Web Services to download 80 years-worth of videos per day.

Chat

FIash Mob #5678@beehaw.org
link
fedilink
arrow-up
20·
3 months ago
And, no doubt, they’ve already run the numbers and figured out that it costs them less to steal the data and pay off fines and lawsuits than it does to actually compensate creators properly.

I really hate the capitalist hellhole I’m stuck in.

Technology@beehaw.org

technology@beehaw.org

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community’s icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

272 users / day
1.06K users / week
2.74K users / month
7.74K users / 6 months
1 local subscriber
37.7K subscribers
2.55K Posts
46.5K Comments
Modlog