TUdatalib : Analyzing Dataset Annotation Quality Management in the Wild

Am Montag, 7.4.2025 wird TUdatalib wegen geplanten Wartungsarbeiten am Speichersystem von 9:00 bis voraussichtlich 9:30 nur eingeschränkt nutzbar sein (kein Datenupload und Download) | Due to scheduled maintenance on the storage system, using TUdatalib will be limited on Monday, April 7 2025 from 9:00 to approx. 9:30 (no data upload or download)

Zur Kurzanzeige

dc.contributor.author	Klie, Jan-Christoph
dc.contributor.author	Eckart de Castilho, Richard
dc.contributor.author	Gurevych, Iryna
dc.date.accessioned	2023-09-07T21:45:11Z
dc.date.available	2023-09-07T21:45:11Z
dc.date.issued	2023-09-07
dc.identifier.uri	https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/3939
dc.identifier.uri	https://doi.org/10.48328/tudatalib-1220
dc.description	This is the accompanying data for the paper "Analyzing Dataset Annotation Quality Management in the Wild". Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models and their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, bias or annotation artifacts. There exist best practices and guidelines regarding annotation projects. But to the best of our knowledge, no large-scale analysis has been performed as of yet on how quality management is actually conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions on how to apply them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication or data validation. Using these annotations, we then analyze how quality management is conducted in practice. We find that a majority of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar. Our analysis also shows common errors, especially with using inter-annotator agreement and computing annotation error rates.	de_DE
dc.relation	IsSupplementTo;arXiv;2307.08153
dc.relation	IsReferencedBy;URL;https://github.com/UKPLab/arxiv2023-qanno
dc.rights	Creative Commons Attribution-NonCommercial 4.0
dc.rights.uri	https://creativecommons.org/licenses/by-nc/4.0/
dc.subject	annotation	de_DE
dc.subject	quality management	de_DE
dc.subject	nlp	de_DE
dc.subject.classification	4.43-04 Künstliche Intelligenz und Maschinelle Lernverfahren	de_DE
dc.subject.classification	4.43-05 Bild- und Sprachverarbeitung, Computergraphik und Visualisierung, Human Computer Interaction, Ubiquitous und Wearable Computing
dc.subject.ddc	004
dc.title	Analyzing Dataset Annotation Quality Management in the Wild	de_DE
dc.type	Dataset	de_DE
tud.unit	TUDa
tud.history.classification	Version=2016-2020;409-05 Interaktive und intelligente Systeme, Bild- und Sprachverarbeitung, Computergraphik und Visualisierung

Dateien zu dieser Ressource

Name:: qanno-tudatalib.zip
Größe:: 509.7MB
Format:: application/zip

Anzahl der Dateien

Name:: license_CC-BY-NC-4.0.rdf
Größe:: 9.525KB
Format:: application/rdf+xml

Anzahl der Dateien

Der Datensatz erscheint in:

Zur Kurzanzeige

Solange nicht anders angezeigt, wird die Lizenz wie folgt beschrieben: Creative Commons Attribution-NonCommercial 4.0