Synthetic data is the new AI gold rush, but critics call it 'data laundering'
Briefly

AI development faces a potential shortage of usable training data as regulations block web scraping. In response, the industry is turning to synthetic data, seen as vital for future model training. According to experts, companies like OpenAI may rely on synthetic data not only due to a scarcity of high-quality data but also to separate their models from copyrighted materials. This practice has been termed "data laundering," with concerns that it allows firms to claim ethical training practices while sidestepping copyright infringement issues.
"Recently in the industry, synthetic data has been talked about a lot," said Sebastien Bubeck, a member of technical staff at OpenAI, in the company's livestreamed release of GPT-5 last week. Bubeck stressed its importance for the future of AI models—an idea echoed by his boss, Sam Altman, who live-tweeted the event, saying he was "excited for much more to come."
Southern believes there's another motive. "It further distances them from any copyrighted materials they've trained on that could land them in hot water." For this reason, he has publicly called the practice "data laundering."
He argues that AI companies could train their models on copyrighted works, generate AI variations, then remove the originals from their datasets. They could then "claim their training set is 'ethical' because it didn't technically train on the original image by their logic," says Southern.
"That's why we call it data laundering, because in a sense, they're attempting to clean the data and strip it of its copyright."
Read at Fast Company
[
|
]