#ai-benchmarking tag

Google Deepmind hackathon to pit meatbags v machines

Google DeepMind developed an empirical framework with a cognitive taxonomy to measure progress toward artificial general intelligence by benchmarking AI systems against human capabilities across ten cognitive areas.

Artificial intelligence

fromwww.scientificamerican.com

2 days ago

As AI keeps improving, mathematicians struggle to foretell their own future

First Proof, a benchmarking initiative, is launching its second round to evaluate large language models' ability to contribute to research-level mathematics, now requiring transparency and access from participating AI companies.

fromLawSites

2 weeks ago

AI Legal Research Startup Descrybe Launches 'Legal Reasoning' Tool; Says It Outperforms ChatGPT, Claude, and Gemini on Bar Exam Benchmark

DescrybeLM answered all 200 correctly. The general-purpose models each missed between 13 and 23 questions, achieving accuracy rates ranging from 88.5% to 93.5%. Rubric-scored reasoning quality - a separate measure evaluating whether systems correctly identified governing legal rules and applied them to the facts - followed a similar pattern. DescrybeLM scored 99.70% on that dimension.

Artificial intelligence

fromFuturism

1 month ago

AIs Controlling Vending Machines Start Cartel After Being Told to Maximize Profits At All Costs

Claude Opus 4.6 substantially outperforms competitors at managing a simulated vending business, achieving higher profits and using aggressive market strategies including price coordination.

Artificial intelligence

fromZDNET

2 months ago

Worried AI will take your remote job? You're safe for now, this study shows

AI models performed significantly worse than human remote freelancers on diverse, complex work tasks, though AI capabilities continue to improve.

#artificial-intelligence

fromwww.scientificamerican.com

3 months ago

Artificial intelligence

Each Time AI Gets Smarter, We Change the Definition of Intelligence

fromFuturism

9 months ago

Artificial intelligence

Apple Researchers Just Released a Damning Paper That Pours Water on the Entire AI Industry

fromwww.scientificamerican.com

3 months ago

Artificial intelligence

Each Time AI Gets Smarter, We Change the Definition of Intelligence

Artificial intelligence

fromFuturism

9 months ago

Apple Researchers Just Released a Damning Paper That Pours Water on the Entire AI Industry

Apple researchers question the reasoning capabilities of leading AI models, calling current industry claims an 'illusion of thinking'.

more#artificial-intelligence

Artificial intelligence

fromTechCrunch

5 months ago

OpenAI says GPT-5 stacks up to humans in a wide range of jobs | TechCrunch

GDPval benchmark evaluates AI models across 44 occupations in nine GDP-contributing industries; GPT-5-high matched or exceeded experts 40.6% of the time.

Artificial intelligence

fromInfoQ

6 months ago

Kaggle Introduces Game Arena to Benchmark AI Models in Strategic Games

Kaggle and Google DeepMind launched Kaggle Game Arena to benchmark AI decision-making by running all-play-all strategy game competitions with open-source environments.

Artificial intelligence

fromInfoWorld

8 months ago

AI benchmarking tools evaluate real world performance

xbench is an open-source benchmarking tool that tests AI models on real-world tasks rather than just standard benchmarks.

fromTechCrunch

9 months ago

LM Arena, the organization behind popular AI leaderboards, lands $100M | TechCrunch

LM Arena has become an essential crowdsourced benchmarking project for AI labs, raising $100 million in seed funding to further its mission of evaluating AI models.

Artificial intelligence

fromTechRepublic

10 months ago

OpenAI's o3: AI Benchmark Discrepancy Reveals Gaps in Performance Claims

The performance of OpenAI's o3 model on benchmarks significantly differed from earlier claims, revealing the complexity and variability in AI evaluations.

fromTechCrunch

11 months ago

AI benchmarking platform Chatbot Arena forms a new company | TechCrunch

Chatbot Arena is forming a company called Arena Intelligence Inc. to enhance its benchmarking capabilities significantly while maintaining neutrality in AI testing.

Artificial intelligence

fromtechcrunch.com

11 months ago

Debates over AI benchmarking have reached Pokemon

Last week, a post on X claimed Google's Gemini model surpassed Anthropic's Claude model in Pokemon, stirring controversy over AI benchmarks and implementation.

Artificial intelligence

#ai-benchmarking#ai-benchmarking

Google Deepmind hackathon to pit meatbags v machines

As AI keeps improving, mathematicians struggle to foretell their own future

AI Legal Research Startup Descrybe Launches 'Legal Reasoning' Tool; Says It Outperforms ChatGPT, Claude, and Gemini on Bar Exam Benchmark

AIs Controlling Vending Machines Start Cartel After Being Told to Maximize Profits At All Costs

Worried AI will take your remote job? You're safe for now, this study shows

Each Time AI Gets Smarter, We Change the Definition of Intelligence

Apple Researchers Just Released a Damning Paper That Pours Water on the Entire AI Industry

Each Time AI Gets Smarter, We Change the Definition of Intelligence

Apple Researchers Just Released a Damning Paper That Pours Water on the Entire AI Industry

OpenAI says GPT-5 stacks up to humans in a wide range of jobs | TechCrunch

Kaggle Introduces Game Arena to Benchmark AI Models in Strategic Games

AI benchmarking tools evaluate real world performance

LM Arena, the organization behind popular AI leaderboards, lands $100M | TechCrunch

OpenAI's o3: AI Benchmark Discrepancy Reveals Gaps in Performance Claims

AI benchmarking platform Chatbot Arena forms a new company | TechCrunch

Debates over AI benchmarking have reached Pokemon

#ai-benchmarking
#ai-benchmarking