AI is ready to take over Python programming, but not much else
Briefly

AI is ready to take over Python programming, but not much else
"Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction."
"The findings show that current LLMs introduce substantial errors when editing work documents, with frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4) losing an average 25% of document content over 20 delegated interactions, and an average degradation across all models of 50%."
"The benchmark contains 310 work environments across 52 professional domains including coding, crystallography, genealogy and music sheet notation. Each environment consists of real documents totaling around 15K tokens in length, and five to 10 complex editing tasks that a user might ask an LLM to perform."
"Tests of how well 19 large language models (LLMs) complete and perform complicated multi-step tasks has shown that they are both error-prone and, in many cases, unreliable."
Nineteen large language models were evaluated on multi-step delegated editing tasks across 52 professional domains. The benchmark, DELEGATE-52, used 310 work environments containing real documents totaling about 15K tokens and five to 10 complex editing tasks per environment. Results showed models were error-prone and often unreliable, producing sparse but severe mistakes that silently corrupt documents. Errors compounded over long interactions. Frontier models lost an average 25% of document content over 20 delegated interactions, and all models showed an average degradation of 50%. The findings indicate that delegating document editing to current LLMs can materially damage work products.
Read at Computerworld
Unable to calculate read time
[
|
]