#model-evaluation

[ follow ]
#artificial-intelligence
Artificial intelligence
fromWIRED
2 months ago

This Tool Probes Frontier AI Models for Lapses in Intelligence

Scale AI's new tool, Scale Evaluation, automates testing of AI models to identify weaknesses and improve performance effectively.
Artificial intelligence
fromWIRED
2 months ago

This Tool Probes Frontier AI Models for Lapses in Intelligence

Scale AI's new tool, Scale Evaluation, automates testing of AI models to identify weaknesses and improve performance effectively.
Data science
fromHackernoon
3 days ago

The Future of Remote Sensing: Few-Shot Learning and Explainable AI | HackerNoon

Few-shot learning techniques for remote sensing enhance model efficiency with limited data, emphasizing the need for explainable AI.
#machine-learning
Artificial intelligence
fromhackernoon.com
5 days ago

Limited Gains: Multi-Token Training on Natural Language Choice Tasks

Multi-token prediction enhances model performance in natural language processing benchmarks.
Larger models lead to improved scalability and faster inference times.
fromHackernoon
2 months ago
Artificial intelligence

When Smaller is Smarter: How Precision-Tuned AI Cracks Protein Mysteries | HackerNoon

fromHackernoon
6 months ago
Bootstrapping

How Many Glitch Tokens Hide in Popular LLMs? Revelations from Large-Scale Testing | HackerNoon

Artificial intelligence
fromhackernoon.com
5 days ago

Limited Gains: Multi-Token Training on Natural Language Choice Tasks

Multi-token prediction enhances model performance in natural language processing benchmarks.
Larger models lead to improved scalability and faster inference times.
fromHackernoon
2 months ago
Artificial intelligence

When Smaller is Smarter: How Precision-Tuned AI Cracks Protein Mysteries | HackerNoon

fromHackernoon
6 months ago
Bootstrapping

How Many Glitch Tokens Hide in Popular LLMs? Revelations from Large-Scale Testing | HackerNoon

Artificial intelligence
fromHackernoon
1 year ago

Behind the Scenes: The Prompts and Tricks That Made Many-Shot ICL Work | HackerNoon

GPT4(V)-Turbo demonstrates variable performance in many-shot ICL, with notable failures to scale effectively under certain conditions.
fromHackernoon
4 weeks ago

Comparing Chameleon AI to Leading Image-to-Text Models | HackerNoon

In evaluating Chameleon, we focus on tasks requiring text generation conditioned on images, particularly image captioning and visual question-answering, with results grouped by task specificity.
Artificial intelligence
Artificial intelligence
fromTechCrunch
1 month ago

OpenAI partner says it had relatively little time to test the company's newest AI models | TechCrunch

Metr claims limited testing time for OpenAI's new models o3 and o4-mini reduces evaluation comprehensiveness.
Software development
fromInfoQ
3 months ago

OpenAI Introduces Software Engineering Benchmark

SWE-Lancer benchmark assesses AI language models on real-world freelance software engineering tasks.
AI models face significant challenges in software engineering despite advancements.
fromHackernoon
1 year ago

Limitations in AI Model Evaluation: Bias, Efficiency, and Human Judgment | HackerNoon

Our work identifies 12 important aspects in real-world deployments of text-to-image generation models, including alignment, quality, aesthetics, reasoning, bias, and efficiency.
Miscellaneous
[ Load more ]