TUdatalib : Annotation Curricula to Implicitly Train Non-Expert Annotators

Zur Kurzanzeige

dc.contributor.author	Lee, Ji-Ung
dc.contributor.author	Klie, Jan-Christoph
dc.contributor.author	Gurevych, Iryna
dc.date.accessioned	2021-06-04T17:24:16Z
dc.date.available	2021-06-04T17:24:16Z
dc.date.issued	2021
dc.identifier.uri	https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2783
dc.description	Annotation studies often require annotators to familiarize themselves with the task, its annotation scheme, and the data domain. This can be overwhelming in the beginning, mentally taxing, and induce errors into the resulting annotations; especially in citizen science or crowd sourcing scenarios where domain expertise is not required and only annotation guidelines are provided. To alleviate these issues, we propose annotation curricula, a novel approach to implicitly train annotators. We gradually introduce annotators into the task by ordering instances that are annotated according to a learning curriculum. To do so, we first formalize annotation curricula for sentence- and paragraph-level annotation tasks, define an ordering strategy, and identify well-performing heuristics and interactively trained models on three existing English datasets. We then conduct a user study with 40 voluntary participants who are asked to identify the most fitting misconception for English tweets about the Covid-19 pandemic. Our results show that using a simple heuristic to order instances can already significantly reduce the total annotation time while preserving a high annotation quality. Annotation curricula thus can provide a novel way to improve data collection. To facilitate future research, we further share our code and data consisting of 2,400 annotations.	en_US
dc.language.iso	en	en_US
dc.relation.isreferencedby	https://arxiv.org/abs/2106.02382
dc.rights	Creative Commons Attribution 4.0
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	NLP	en_US
dc.subject	Annotation Curriculum	en_US
dc.subject	Interactive Learning	en_US
dc.subject	Semantic Similarity	en_US
dc.subject.classification	4.43-06 Datenmanagement, datenintensive Systeme, Informatik-Methoden in der Wirtschaftsinformatik	en_US
dc.subject.ddc	004
dc.title	Annotation Curricula to Implicitly Train Non-Expert Annotators	en_US
dc.type	Dataset	en_US
dc.type	Text	en_US
tud.project	EU/EFRE \| 20005482 \| TexPrax - Gurevych	en_US
tud.project	DFG \| GU798/21-1 \| Infrastruktur für in	en_US
tud.unit	TUDa
tud.history.classification	Version=2020-2024;409-06 Informationssysteme, Prozess- und Wissensmanagement

Dateien zu dieser Ressource

Name:: license_CC-BY-4.0.rdf
Größe:: 7.877KB
Format:: application/rdf+xml

Anzahl der Dateien

Name:: data.zip
Größe:: 48.71KB
Format:: application/zip

Anzahl der Dateien

Der Datensatz erscheint in:

Semantic Relatedness [5]

Zur Kurzanzeige

Solange nicht anders angezeigt, wird die Lizenz wie folgt beschrieben: Creative Commons Attribution 4.0