#ai-evaluation
#ai-evaluation

Panels of peers are needed to gauge AI's trustworthiness - experts are not enough

Expert-only evaluation methods like the 'Sunstein test' risk concentrating AI trustworthiness judgments among elites, thereby reinforcing existing power structures shaping AI objectives.

Startup companies

fromFortune

Meet the world's youngest self-made billionaire, who skipped finals to make an empire out of teaching AI 'what only humans know' | Fortune

A startup matched overseas engineers with companies, expanded into human-in-the-loop AI evaluation services, and scaled rapidly to a $10 billion valuation.

fromTechCrunch

Laude Institute announces first batch of 'Slingshots' AI grants | TechCrunch

On Thursday, the Laude Institute announced its first batch of Slingshots grants, aimed at "advancing the science and practice of artificial intelligence." Designed as an accelerator for researchers, the Slingshots program is meant to provide resources that would be unavailable in most academic settings, whether it's funding, compute power, or product and engineering support. In exchange, the recipients pledge to produce some final work product, whether it's a startup, an open-source codebase, or another type of artifact.

Artificial intelligence

fromInfoWorld

Databricks adds customizable evaluation tools to boost AI agent accuracy

Agent Bricks' Agent-as-a-Judge, Tunable Judges, and Judge Builder enable enterprises to customize evaluations and align agent behavior with business-specific standards.

fromNature

6 months ago

We need a new Turing test to assess AI's real-world knowledge

Some lawyers have learnt that the hard way, and have been fined for filing AI-generated court briefs that misrepresented principles of law and cited non-existent cases. The same is true in other fields. For example, AI models can pass the gold-standard test in finance - the Chartered Financial Analyst exam - yet score poorly on simple tasks required of entry-level financial analysts (see go.nature.com/42tbrgb).

Artificial intelligence

fromGeeky Gadgets

6 months ago

How to Fix Your AI Prompt Writing : 6 Principles That Actually Work

What if the secret to unlocking AI's full potential wasn't in the technology itself but in how we use it? After spending over 200 hours teaching AI to write, Nate B Jones discovered that the biggest mistakes aren't about algorithms or software limitations, they're about human misunderstanding. Too often, we assume AI can read between the lines of vague instructions or magically produce brilliance without guidance. The result? Generic, uninspired content that misses the mark.

Artificial intelligence

fromNature

6 months ago

AI language models killed the Turing test: do we even need a replacement?

Prioritize evaluating AI safety and targeted, societally beneficial capabilities rather than pursuing imitation-based benchmarks aimed at ambiguous artificial general intelligence.

fromFuturism

OpenAI Releases List of Work Tasks It Says ChatGPT Can Already Replace

ChatGPT maker OpenAI has released a new evaluation, dubbed GDPval, to measure how well its AIs perform on "economically valuable, real-world tasks across 44 occupations." "People often speculate about AI's broader impact on society, but the clearest way to understand its potential is by looking at what models are already capable of doing," the company wrote in an accompanying blog post. "Evaluations like GDPval help ground conversations about future AI improvements in evidence rather than guesswork, and can help us track model improvement over time," OpenAI added.

Artificial intelligence

fromInfoQ

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

Google Stax provides an objective, data-driven, repeatable framework for AI model evaluation with customizable datasets, default and custom evaluators, and LLM-based judges.

fromZDNET

OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising

OpenAI's GDPval evaluates AI performance on 1,320 real-world tasks across 44 occupations to measure economic impact and narrow theory-practice gaps.

#generative-ai

fromFortune

Venture

Exclusive: Touring Capital, founded by ex-M12 and SoftBank investors, closes $330 million first fund | Fortune

fromNew Scientist

Artificial intelligence

Around one-third of AI search tool answers make unsupported claims

fromFortune

Venture

Exclusive: Touring Capital, founded by ex-M12 and SoftBank investors, closes $330 million first fund | Fortune

fromNew Scientist

Artificial intelligence

Around one-third of AI search tool answers make unsupported claims

more#generative-ai

fromBusiness Insider

Read the deck an ex-Waymo engineer used to raise $3.75 million from Sheryl Sandberg and Kindred Ventures

Scorecard raised $3.75M to build an AI evaluation platform that tests AI agents for performance, safety, and faster deployment for startups and enterprises.

fromFast Company

8 months ago

The gap between AI hype and newsroom reality

AI tools often underperform as informed researchers and summarizers for newsroom-specific tasks like local government meeting transcripts.

fromodsc.medium.com

8 months ago

The ODSC AI West 2025 Preliminary Schedule, Mastering AI Evaluation, Building Real World Agentic Applications, and GPT-5 News.

ODSC AI West 2025 offers training on MLOps, RAG, and AI Agents, featuring over 250 industry experts to enhance practical skills.

fromAxios

8 months ago

OpenAI's GPT-5 targets coders

GPT-5 combines a large language model with reasoning capabilities, improving task efficiency and reducing hallucinations, with notable applications in health.

10 months ago

Two Indispensable Tools for Measuring the Quality of AI Systems

Evaluating free text outputs of LLMs requires significant human judgment and high-quality ground truth data.

fromHackernoon

2 years ago

AI Still Can't Explain a Joke-or a Metaphor-Like a Human Can | HackerNoon

Human evaluation is crucial for assessing AI's understanding of multimodal figurative language.

fromBusiness Insider

10 months ago

Anthropic's Claude plays 'for peace over victory' in a game of Diplomacy against other AI

Diplomacy is a strategic board game set on a map of Europe in 1901 - a time when tensions between the continent's most powerful countries were simmering in the lead-up to World War I.

Artificial intelligence

fromwww.mercurynews.com

Chatbot Arena group goes from academic project at UC Berkeley to $600 million startup

LMArena has raised $100 million in seed funding, demonstrating significant investor confidence in AI evaluation tools.

Evaluation Mindset: Taming the Gen AI Dragon

Evaluation in AI is a mindset, not a resource issue; it requires ongoing inquiry and critical thinking for successful application deployment.

Beyond Benchmarks: Really Evaluating AI

Benchmarks help standardize test sets for AI models, ensuring fair evaluation of performance.

fromHackernoon

Chameleon AI Shows Competitive Edge Over LLaMa-2 and Other Models | HackerNoon

Chameleon exhibits competitive performance against leading text-only language models, excelling particularly in commonsense reasoning.

The evaluations indicate that Chameleon is capable of outperforming larger models like Llama-2 in specific benchmarks.

The problems with running human evals

Result ambiguity can come in different forms. The lack of agreement among raters is the most common one, known as Inter Rater Reliability (IRR).

Artificial intelligence

fromHackernoon

1 year ago

MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data: Single-Subject Evaluations | HackerNoon

Evaluation metrics improve significantly with increased fine-tuning data.