"New research suggests an AI agent can't fully replace a human consultant - at least for now. Mercor, the AI training giant, tested how well leading AI models, acting as agents, performed real-world consulting, banking, and legal tasks. The models failed most of the time, but Mercor's CEO, Brendan Foody, told Business Insider that the results tell only part of the story."
"Across all task categories, the AI agents successfully completed the tasks less than 25% of the time on the first try. Given eight attempts, the agents could only complete 40% of the tasks. For the management consulting tasks, OpenAI's GPT 5.2 initially performed the best, completing nearly 23% of the tasks on its first attempt. Anthropic's Opus 4.6, released this week, performed even better at nearly 33%."
"While many of the tasks were not completed, Foody said the success rate for GPT 3 was only 3%, compared to 23% for GPT 5.2. Anthropic's model went from 13% to 33% on consulting tasks in a matter of months. Foody said he expects the success rate of the models to be closer to 50% by the end of the year."
Mercor evaluated leading AI models acting as agents on simulated real-world consulting, banking, and legal tasks using the APEX-Agents benchmark. The consulting tasks were based on expert surveys and input from major firms such as McKinsey, BCG, Deloitte, Accenture, and EY. Across categories, agents completed tasks less than 25% of the time on the first try and only about 40% given eight attempts. GPT 5.2 initially completed nearly 23% of consulting tasks, while Anthropic’s Opus 4.6 reached nearly 33%. GPT-3 completed about 3% initially, and some models improved rapidly month-to-month. Models are projected to approach roughly 50% success later in the year.
Read at Business Insider
Unable to calculate read time
Collection
[
|
...
]