Microsoft researchers find AI models and agents can't handle long-running tasks
Briefly

Microsoft researchers find AI models and agents can't handle long-running tasks
"“Our findings show that current LLMs introduce substantial errors when editing work documents, with frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4) losing on average 25 percent of document content over 20 delegated interactions, and an average degradation across all models of 50 percent,” the authors report"
"“To test how LLMs handle long-running knowledge work tasks, the researchers devised a benchmark called DELEGATE-52. It simulates multistep workflows across 52 professional domains, such as writing code, crystallography, and music notation. It is a more taxing test than sorting a spreadsheet, a task that should be table stakes for any aspiring workflow agent.”"
"“In the accounting domain, for example, the challenge involves a seed document that represents the accounting ledger of Hack Club, a nonprofit organization. The model is asked to split the seed document into separate category-based files and then to merge these chronologically back into a single file.”"
"“Claude Cowork handles tasks autonomously. Give it a goal and Claude works on your computer, local files, and applications to return a finished deliverable.”"
Large language models used as autonomous workflow agents can corrupt work documents during long, multistep tasks. Microsoft researchers studied how LLMs perform when asked to complete knowledge work over extended interactions. They created the DELEGATE-52 benchmark, simulating multistep workflows across 52 professional domains including coding, crystallography, and music notation. In an accounting scenario, a model must split a ledger into category-based files and then merge them chronologically. Results show that frontier models lose significant document content after many delegated edits, with average degradation across models reaching about half of the content. The findings indicate that workflow agents should be constrained to reduce error accumulation.
Read at theregister
Unable to calculate read time
[
|
]