TUdatalib : Wikipedia Text Segmentation

Anzahl der Dateien: 1

WikipediaTextSegmentation.zip (10.97MB)

Datum

2012

Personen

Typen

Dataset

Beschreibung

For corpus generation, we extracted top-level sections of featured articles and concatenated their textual contents to a pure-text corpus file. The content of a section is constituted by the concatenation of the text of its paragraph elements and the content of contained sections. Particularly, other elements such as tables and image captions are ignored during generating the text for a section because text segmentation is meant to be applied to prose and not to pieces of information such as table fields. Furthermore, sections with one of the titles ``See also'', ``References'', and ``External links'' are skipped as they do not contain information where segmentation makes sense.

DFG-Fächer

4.43-04 Künstliche Intelligenz und Maschinelle Lernverfahren
4.43-05 Bild- und Sprachverarbeitung, Computergraphik und Visualisierung, Human Computer Interaction, Ubiquitous und Wearable Computing

URI

https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2454

Sammlungen

Text Segmentation [1]

Die folgenden Lizenzbestimmungen sind mit dieser Ressource verbunden:

Solange nicht anders angezeigt, wird die Lizenz wie folgt beschrieben: CC BY-SA 3.0