@CadeMetz @ceciliakang @sheeraf @stuartathompson @nicogrant: How Tech Giants Cut Corners to Harvest Data for A.I.


[This is a must-read, deeply researched, long form article about how Big Tech–mostly OpenAI, Google and Microsoft–are abrogating consumers trust and their promises to creators in a mad, greedy, frothing rush to some unknown payoff with AI. The Dot Bomb boom is dwarfed by the AI gold rush, but this article is a road map to just how bad it really is and how debased these people really are. Thanks to the destruction of the newsroom, only a handful of news outlets can deliver work of this quality, but thankfully the New York Times is still standing. How long is another story.]

OpenAI, Google and Meta ignored corporate policies, altered their own rules and discussed skirting copyright law as they sought online information to train their newest artificial intelligence systems….

OpenAI researchers created a speech recognition tool called Whisper. It could transcribe the audio from YouTube videos, yielding new conversational text that would make an A.I. system smarter.

Some OpenAI employees discussed how such a move might go against YouTube’s rules, three people with knowledge of the conversations said. YouTube, which is owned by Google, prohibits use of its videos for applications that are “independent” of the video platform.

Ultimately, an OpenAI team transcribed more than one million hours of YouTube videos, the people said….

Like OpenAI, Google transcribed YouTube videos to harvest text for its A.I. models, five people with knowledge of the company’s practices said. That potentially violated the copyrights to the videos, which belong to their creators.

Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company’s privacy team and an internal message viewed by The Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of its A.I. products.

The companies’ actions illustrate how online information — news stories, fictional works, message board posts, Wikipedia articles, computer programs, photos, podcasts and movie clips — has increasingly become the lifeblood of the booming A.I. industry. 

Read the post on New York Times.