Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

Dycke, Nils; Gurevych, Iryna

Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

dc.contributor.author	Dycke, Nils
dc.contributor.author	Gurevych, Iryna
dc.date.accessioned	2025-09-16T14:18:20Z
dc.date.created	2025-07
dc.date.issued	2025-09-16
dc.description	Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper’s results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly. This dataset includes the papers, the extracted research logic for each paper, and all predictions and evaluations per ARG and review difference dimensions. Use it together with the code provided here: https://github.com/UKPLab/counter-review-logic
dc.description.version	0.1
dc.identifier.uri	https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/4802
dc.language.iso	en
dc.rights.license	CC-BY-4.0 (https://creativecommons.org/licenses/by/4.0)
dc.subject	Peer review, counterfactual evaluation, automatic reviewer, llm
dc.subject.classification	4.43-04
dc.subject.ddc	004
dc.title	Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework
dc.type	Text
dcterms.accessRights	restrictedAccess
person.identifier.orcid	#PLACEHOLDER_PARENT_METADATA_VALUE#
person.identifier.orcid	#PLACEHOLDER_PARENT_METADATA_VALUE#
tuda.agreements	true
tuda.unit	TUDa

Files

Original bundle

Now showing 1 - 4 of 4

Name	Description	Size	Format
papers_v0.1.zip	The papers of the dataset with split information and venue configuration (e.g. review template).	376.97 MB	ZIP-Archivdateien	Download
blueprints_v0.2.zip	Research logic (aka "blueprint") per paper.	931.79 KB	ZIP-Archivdateien	Download
counter-review-cfs_v0.1.zip	The actual counterfactuals used for evaluation. Download only this data if you want to test your own ARG.	32.25 MB	ZIP-Archivdateien	Download
counter-review-replicate_v0.1.zip	Download this file to get all files for replication of the exact experiments in the paper and all intermediate steps.	231.38 MB	ZIP-Archivdateien	Download

Simple item page

Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

Files

Original bundle

Collections