
"On the question of whether the ChatGPT summaries "could feasibly blend into the rest of your summary lineups, the average summary rated a score of just 2.26 on a scale of 1 ("no, not at all") to 5 ("absolutely"). On the question of whether the summaries were "compelling," the LLM summaries averaged just 2.14 on the same scale. Across both questions, only a single summary earned a "5" from the human evaluator on either question, compared to 30 ratings of "1.""
"Not up to standards Writers were also asked to write out more qualitative assessments of the individual summaries they evaluated. In these, the writers complained that ChatGPT often conflated correlation and causation, failed to provide context (e.g., that soft actuators tend to be very slow), and tended to overhype results by overusing words like "groundbreaking" and "novel" (though this last behavior went away when the prompts specifically addressed it)."
"Overall, the researchers found that ChatGPT was usually good at "transcribing" what was written in a scientific paper, especially if that paper didn't have much nuance to it. But the LLM was weak at "translating" those findings by diving into methodologies, limitations, or big picture implications. Those weaknesses were especially true for papers that offered multiple differing results, or when the LLM was asked to summarize two related papers into one brief."
Quantitative surveys among journalists rated ChatGPT summaries poorly: average 2.26 on blending into summary lineups and 2.14 for being compelling. Only one summary earned a top score while thirty received the lowest rating. Journalists reported qualitative problems: conflating correlation with causation, failing to provide necessary context (for example, that soft actuators are very slow), and overhyping results with words like "groundbreaking" and "novel" unless prompts explicitly discouraged that language. ChatGPT reliably transcribed paper text but struggled to translate findings into methodological detail, limitations, or big-picture implications. Weaknesses amplified when papers contained multiple differing results or when summarizing two papers together.
Read at Ars Technica
Unable to calculate read time
Collection
[
|
...
]