“You don’t need to train on novels and pop songs to get the benefits of AI in science” @ednewtonrex


You Don’t Need to Steal Art to Cure Cancer: Why Ed Newton-Rex Is Right About AI and Copyright

Ed Newton-Rex said the quiet truth out loud: you don’t need to scrape the world’s creative works to build AI that saves lives. Or even beat the Chinese Communist Party.

It’s a myth that AI “has to” ingest novels and pop lyrics to learn language. Models acquire syntax, semantics, and pragmatics from any large, diverse corpus of natural language. That includes transcribed speech, forums, technical manuals, government documents, Wikipedia, scientific papers, and licensed conversational data. Speech systems learn from audio–text pairs, not necessarily fiction; text models learn distributional patterns wherever language appears. Of course, literary works can enrich style, but they’re not necessary for competence: instruction tuning, dialogue data, and domain corpora yield fluent models without raiding copyrighted art. In short, creative literature is optional seasoning, not the core ingredient for teaching machines to “speak.”

Google’s new cancer-therapy paper proves the point. Their model wasn’t trained on novels, lyrics, or paintings. It was trained responsibly on scientific data. And yet it achieved real, measurable progress in biomedical research. That simple fact dismantles one of Silicon Valley’s most persistent myths: that copyright is somehow an obstacle to innovation.

You don’t need to train on Joni Mitchell to discover a new gene pathway. You don’t need to ingest John Coltrane to find a drug target. AI used for science can thrive within the guardrails of copyright because science itself already has its own open-data ecosystems—peer-reviewed, licensed, and transparent.

The companies like Anthropic and Meta insisting that “fair use” covers mass ingestion of stolen creative works aren’t curing diseases; they’re training entertainment engines. They’re ripping off artists’ livelihoods to make commercial chatbots, story generators, and synthetic-voice platforms designed to compete against the very creators whose works they exploited. That’s not innovation—it’s market capture through appropriation.

They do it for reasons old as time—they do it for the money.

The ethical divide is clear:

  • AI for discovery builds on licensed scientific data.
  • AI for mimicry plunders culture to sell imitation.

We should celebrate the first and regulate the second. Upholding copyright and requiring provenance disclosures doesn’t hinder progress—it restores integrity. The same society that applauds AI in medical breakthroughs can also insist that creative industries remain human-centered and law-abiding. Civil-military fusion doesn’t imply that there’s only two ingredients in the gumbo of life.

If Google can advance cancer research without stealing art, so can everyone else and so can Google keep different rules for the entertainment side of their business or investment portfolio. The choice isn’t between curing cancer and protecting artists—it’s between honesty and opportunism. The repeated whinging of AI labs about “because China” would be a lot more believable if they used their political influence to get the CCP to release Hong Kong activist Jimmy Lai from stir. We can join Jimmy and his amazingly brave son Sebastian and say “because China”, too. #FreeJimmyLai

Must Read Post by @ednewtonrex on Why He Resigned from Stability AI Over Fake Fair Use Defense

I’ve resigned from my role leading the Audio team at Stability AI, because I don’t agree with the company’s opinion that training generative AI models on copyrighted works is ‘fair use’. 

First off, I want to say that there are lots of people at Stability who are deeply thoughtful about these issues. I’m proud that we were able to launch a state-of-the-art AI music generation product trained on licensed training data, sharing the revenue from the model with rights-holders. I’m grateful to my many colleagues who worked on this with me and who supported our team, and particularly to Emad for giving us the opportunity to build and ship it. I’m thankful for my time at Stability, and in many ways I think they take a more nuanced view on this topic than some of their competitors. 

But, despite this, I wasn’t able to change the prevailing opinion on fair use at the company. 

This was made clear when the US Copyright Office recently invited public comments on generative AI and copyright, and Stability was one of many AI companies to respond. Stability’s 23-page submission included this on its opening page: 

“We believe that Al development is an acceptable, transformative, and socially-beneficial use of existing content that is protected by fair use”. 

For those unfamiliar with ‘fair use’, this claims that training an AI model on copyrighted works doesn’t infringe the copyright in those works, so it can be done without permission, and without payment. This is a position that is fairly standard across many of the large generative AI companies, and other big tech companies building these models — it’s far from a view that is unique to Stability. But it’s a position I disagree with. 

I disagree because one of the factors affecting whether the act of copying is fair use, according to Congress, is “the effect of the use upon the potential market for or value of the copyrighted work”. Today’s generative AI models can clearly be used to create works that compete with the copyrighted works they are trained on. So I don’t see how using copyrighted works to train generative AI models of this nature can be considered fair use. 

But setting aside the fair use argument for a moment — since ‘fair use’ wasn’t designed with generative AI in mind — training generative AI models in this way is, to me, wrong. Companies worth billions of dollars are, without permission, training generative AI models on creators’ works, which are then being used to create new content that in many cases can compete with the original works. I don’t see how this can be acceptable in a society that has set up the economics of the creative arts such that creators rely on copyright. 

To be clear, I’m a supporter of generative AI. It will have many benefits — that’s why I’ve worked on it for 13 years. But I can only support generative AI that doesn’t exploit creators by training models — which may replace them — on their work without permission. 

I’m sure I’m not the only person inside these generative AI companies who doesn’t think the claim of ‘fair use’ is fair to creators. I hope others will speak up, either internally or in public, so that companies realise that exp