TUdatalib : Football Coreference Corpus

Zur Kurzanzeige

dc.contributor.author	Bugert, Michael
dc.contributor.author	Reimers, Nils
dc.contributor.author	Barhom, Shany
dc.contributor.author	Dagan, Ido
dc.contributor.author	Gurevych, Iryna
dc.date.accessioned	2020-03-19T17:24:31Z
dc.date.available	2020-03-19T17:24:31Z
dc.date.issued	2020
dc.identifier.uri	https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2305
dc.description	This script generates: 1. the original sentence-level Football Coreference Corpus (FCC), 2. a version of the sentence-level FCC which was cleaned and updated after manual review, 3. FCC-T, the extended version of the Football Coreference Corpus with reannotated token-level spans, 4. and publication date annotations for the ECB+ corpus. The script downloads the original documents from archive.org's WaybackMachine, cleans and processes them locally on your machine and combines the result with our annotations. See README.md for instructions. Details on the annotations and corpora: * For the original FCC, see Bugert et al. 2020 "Breaking the Subtopic Barrier in Cross-Document Event Coreference Resolution", http://ceur-ws.org/Vol-2593/paper3.pdf * For the token-level reannotation FCC-T, see Bugert et al. 2020 "Cross-Document Event Coreference Resolution Beyond Corpus-Tailored Systems", https://arxiv.org/abs/2011.12249 In case of trouble with this downloader, please get in touch on Github: https://github.com/UKPLab/cdcr-beyond-corpus-tailored/issues Cross-document event coreference resolution (CDCR) is the task of detecting and clustering mentions of events across a set of documents. A major bottleneck in CDCR is a lack of appropriate datasets, which stems from the difficulty of annotating data for this task. We present the first scalable approach for annotating cross-subtopic event coreference links, a highly valuable but rarely occurring type of cross-document link. The annotation of these links requires combing through hundreds of documents - an endeavor for which conventional token-level annotation schemes with trained expert annotators are too expensive. We instead propose crowdsourcing annotation on sentence level to achieve scalability. We apply our approach to create the Football Coreference Corpus (FCC), a corpus of 451 sports news reports, while reaching high agreement between NLP experts and crowd annotators in the process.	en_US
dc.language.iso	en	en_US
dc.rights	Creative Commons Attribution Share-Alike 4.0
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.subject.classification	4.43-04 Künstliche Intelligenz und Maschinelle Lernverfahren	en_US
dc.subject.classification	4.43-05 Bild- und Sprachverarbeitung, Computergraphik und Visualisierung, Human Computer Interaction, Ubiquitous und Wearable Computing
dc.subject.ddc	004
dc.title	Football Coreference Corpus	en_US
dc.type	Dataset	en_US
dc.type	Text	en_US
tud.history.classification	Version=2016-2020;409-05 Interaktive und intelligente Systeme, Bild- und Sprachverarbeitung, Computergraphik und Visualisierung

Dateien zu dieser Ressource

Name:: FCC_FCC-T_ECBp_dates_2022-05-25.zip
Größe:: 910.7KB
Format:: application/zip

Anzahl der Dateien

Name:: license_CC-BY-SA-4.0.rdf
Größe:: 9.723KB
Format:: application/rdf+xml

Anzahl der Dateien

Der Datensatz erscheint in:

Coreference Resolution [2]

Zur Kurzanzeige

Solange nicht anders angezeigt, wird die Lizenz wie folgt beschrieben: Creative Commons Attribution Share-Alike 4.0