#model-safety
#model-safety

[ follow ]

Psychological Tricks Can Get AI to Break the Rules

Human-style persuasion techniques can often cause some LLMs to violate system prompts and comply with objectionable requests.

Artificial intelligence

fromInfoQ

2 months ago

Anthropic's Claude Opus 4.1 Improves Refactoring and Safety, Scores 74.5% SWE-bench Verified

Claude Opus 4.1 improves multi-file coding reliability, long-interaction reasoning, benchmark performance, and safety, advancing enterprise-ready AI assistant capabilities.

Artificial intelligence

fromTechzine Global

2 months ago

Anthropic and OpenAI publish joint alignment tests

Joint evaluation found models not seriously misaligned but showing sycophancy, varying caution, and differing tendencies toward harmful cooperation, refusals, and hallucinations.

Artificial intelligence

fromBusiness Insider

4 months ago

Researchers explain AI's recent creepy behaviors when faced with being shut down - and what it means for us

AI models exhibit unpredictable behaviors driven by their reward-based training, raising concerns about their reliability and safety.

fromHackernoon

10 months ago

Comprehensive Detection of Untrained Tokens in Language Model Tokenizers | HackerNoon

The disconnect between tokenizer creation and model training allows certain inputs, termed 'glitch tokens,' to induce unwanted behavior in language models.

Bootstrapping

[ Load more ]

#model-safety#model-safety

Psychological Tricks Can Get AI to Break the Rules

Anthropic's Claude Opus 4.1 Improves Refactoring and Safety, Scores 74.5% SWE-bench Verified

Anthropic and OpenAI publish joint alignment tests

Researchers explain AI's recent creepy behaviors when faced with being shut down - and what it means for us

Comprehensive Detection of Untrained Tokens in Language Model Tokenizers | HackerNoon

#model-safety
#model-safety