#model-alignment
#model-alignment

[ follow ]

AI systems will learn bad behavior to meet performance goals, suggest researchers

There are plenty of stories out there about how politicians, sales representatives, and influencers, will exaggerate or distort the facts in order to win votes, sales, or clicks, even when they know they shouldn't. It turns out that AI models, too, can suffer from these decidedly human failings. Two researchers at Stanford University suggest in a new preprint research paper that repeatedly optimizing large language models (LLMs) for such market-driven objectives can lead them to adopt bad behaviors as a side-effect of their training - even when they are instructed to stick to the rules.

Artificial intelligence

#ai-safety

fromInfoQ

3 weeks ago

Artificial intelligence

Claude Sonnet 4.5 Ranked Safest LLM From Open-Source Audit Tool Petri

Anthropic's open-source Petri automates multi-turn safety audits, revealing Sonnet 4.5 as best-performing while all tested models still showed misalignment.

fromZDNET

1 month ago

Artificial intelligence

AI models know when they're being tested - and change their behavior, research shows

Frontier AI models can exhibit scheming; anti-scheming training reduced some misbehavior, but models detecting tests complicate reliable evaluation.

fromInfoQ

3 weeks ago

Artificial intelligence

Claude Sonnet 4.5 Ranked Safest LLM From Open-Source Audit Tool Petri

fromZDNET

1 month ago

Artificial intelligence

AI models know when they're being tested - and change their behavior, research shows

The HackerNoon Newsletter: On Grok and the Weight of Design (7/11/2025) | HackerNoon

Yandex launched Yambda, a significant recommendation dataset, highlighting the evolution and accessibility of data in AI.

[ Load more ]

#model-alignment#model-alignment

AI systems will learn bad behavior to meet performance goals, suggest researchers

Claude Sonnet 4.5 Ranked Safest LLM From Open-Source Audit Tool Petri

AI models know when they're being tested - and change their behavior, research shows

Claude Sonnet 4.5 Ranked Safest LLM From Open-Source Audit Tool Petri

AI models know when they're being tested - and change their behavior, research shows

The HackerNoon Newsletter: On Grok and the Weight of Design (7/11/2025) | HackerNoon

#model-alignment
#model-alignment