TUdatalib : Wikipedia Text Segmentation

Zur Kurzanzeige

dc.contributor.author	Martin, Marko
dc.contributor.author	Zesch, Torsten
dc.contributor.author	Erbs, Nicolai
dc.contributor.author	Gurevych, Iryna
dc.date.accessioned	2020-07-25T09:41:19Z
dc.date.available	2020-07-25T09:41:19Z
dc.date.issued	2012
dc.identifier.uri	https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2454
dc.description	For corpus generation, we extracted top-level sections of featured articles and concatenated their textual contents to a pure-text corpus file. The content of a section is constituted by the concatenation of the text of its paragraph elements and the content of contained sections. Particularly, other elements such as tables and image captions are ignored during generating the text for a section because text segmentation is meant to be applied to prose and not to pieces of information such as table fields. Furthermore, sections with one of the titles ``See also'', ``References'', and ``External links'' are skipped as they do not contain information where segmentation makes sense.	en_US
dc.language.iso	en	en_US
dc.rights	CC BY-SA 3.0
dc.rights.uri	https://creativecommons.org/licenses/by-sa/3.0/
dc.subject.classification	4.43-04 Künstliche Intelligenz und Maschinelle Lernverfahren	en_US
dc.subject.classification	4.43-05 Bild- und Sprachverarbeitung, Computergraphik und Visualisierung, Human Computer Interaction, Ubiquitous und Wearable Computing
dc.subject.ddc	004
dc.title	Wikipedia Text Segmentation	en_US
dc.type	Dataset	en_US
tud.history.classification	Version=2016-2020;409-05 Interaktive und intelligente Systeme, Bild- und Sprachverarbeitung, Computergraphik und Visualisierung

Dateien zu dieser Ressource

Name:: WikipediaTextSegmentation.zip
Größe:: 10.97MB
Format:: application/zip

Anzahl der Dateien

Der Datensatz erscheint in:

Text Segmentation [1]

Zur Kurzanzeige

Solange nicht anders angezeigt, wird die Lizenz wie folgt beschrieben: CC BY-SA 3.0