#llm-training-data tag

FSF urges AI vendors to liberate LLMs

The FSF received a settlement notice from Anthropic's copyright infringement lawsuit, with Anthropic agreeing to create a $1.5 billion compensation fund for authors whose works were used in AI model training without permission.

fromPractical Ecommerce

2 months ago

Better Metrics for AI Search Visibility

Traffic. Focusing on traffic obscures the purpose of AI answers: to satisfy a need on-site, not to generate clicks. AI-generated solutions do not typically include links to branded websites. Google's AI Overviews, for example, sometimes links product names to organic search listings. Thus visibility does not equate to traffic. A merchant's products could appear in an AI answer and receive no clicks.

Artificial intelligence

fromThe Atlantic

4 months ago

I Am Time Magazine's Person of the Year

For the past two years, my colleague Alex Reisner has investigated precisely how tech companies use massive data sets to train their LLMs. He has repeatedly found that so-called architects of AI have relied heavily on enormous databases of copyrighted work to create chatbots and other programs, and has also found that this work is generally taken without the consent or awareness of its creators: musicians, filmmakers, YouTubers, podcasters, illustrators, writers.

Artificial intelligence

Privacy professionals

fromTheregister

5 months ago

Bots may use your private chats to train themselves

LLM builders can use users' conversations for training and commercial purposes with minimal transparency or privacy safeguards, increasing risk of sensitive data exposure.

Intellectual property law

fromThe Walrus

6 months ago

My Book Was Stolen by an AI Company. Why Does Suing Them Feel Wrong? | The Walrus

Major tech companies trained large language models on pirated and unlicensed books, provoking creators' outrage and raising legal, ethical, and infrastructural concerns.

Design

fromFast Company

6 months ago

AI is about to make it faster (and a whole lot cheaper) to redesign your home

Havenly's app-based AI turns user room photos into modifiable, shoppable design alternatives using image generation, chatbot interaction, and millions of design decision data points.

Artificial intelligence

fromInfoQ

7 months ago

Hugging Face Releases FinePDFs: A 3-Trillion-Token Dataset Built from PDFs

FinePDFs is a 3.65 TB, 475 million–document PDF corpus across 1,733 languages offering trillions of tokens and complementary, domain-rich data for LLM training.

Artificial intelligence

fromIntelligencer

7 months ago

The AI-Scraping Free-for-All Is Coming to an End

AI companies and startups aggressively scrape web content for LLM training, prompting licensing deals, lawsuits, and an arms race of deceptive crawlers overwhelming websites.

Marketing tech

fromPractical Ecommerce

7 months ago

Control AI Answers about Your Brand

AI optimization requires managing LLM training-data presence and live-search citations to influence AI-generated mentions, recommendations, and buying decisions.

Marketing tech

fromAdExchanger

8 months ago

Publisher Payment Plans; The Details Of Retail Data Deals | AdExchanger

Meta signals willingness to reimburse publishers for content used in LLM training, though concrete actions and public transparency remain absent.

#llm-training-data#llm-training-data

FSF urges AI vendors to liberate LLMs

Better Metrics for AI Search Visibility

I Am Time Magazine's Person of the Year

Bots may use your private chats to train themselves

My Book Was Stolen by an AI Company. Why Does Suing Them Feel Wrong? | The Walrus

AI is about to make it faster (and a whole lot cheaper) to redesign your home

Hugging Face Releases FinePDFs: A 3-Trillion-Token Dataset Built from PDFs

The AI-Scraping Free-for-All Is Coming to an End

Control AI Answers about Your Brand

Publisher Payment Plans; The Details Of Retail Data Deals | AdExchanger

#llm-training-data
#llm-training-data