Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework
Loading...
Files
Date
2025-09-16
Type
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Description
Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper’s results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.
This dataset includes the papers, the extracted research logic for each paper, and all predictions and evaluations per ARG and review difference dimensions. Use it together with the code provided here: https://github.com/UKPLab/counter-review-logic
Citation
Endorsement
DFG Classification
Project(s)
Faculty
Collections
License
Except where otherwise noted, this license is described as CC BY 4.0 - Attribution 4.0 International

