Wikipedia Text Segmentation

dc.contributor.author Martin, Marko
dc.contributor.author Zesch, Torsten
dc.contributor.author Erbs, Nicolai
dc.contributor.author Gurevych, Iryna
dc.date.accessioned 2020-07-25T09:41:19Z
dc.date.available 2020-07-25T09:41:19Z
dc.date.created 2012
dc.date.issued 2020-07-25
dc.description For corpus generation, we extracted top-level sections of featured articles and concatenated their textual contents to a pure-text corpus file. The content of a section is constituted by the concatenation of the text of its paragraph elements and the content of contained sections. Particularly, other elements such as tables and image captions are ignored during generating the text for a section because text segmentation is meant to be applied to prose and not to pieces of information such as table fields. Furthermore, sections with one of the titles ``See also'', ``References'', and ``External links'' are skipped as they do not contain information where segmentation makes sense. en_US
dc.identifier.uri https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2454
dc.language.iso en en_US
dc.rights CC BY-SA 3.0
dc.rights.licenseother
dc.rights.uri https://creativecommons.org/licenses/by-sa/3.0/
dc.subject.classification 4.43-04
dc.subject.classification 4.43-05
dc.subject.ddc 004
dc.title Wikipedia Text Segmentation en_US
dc.type Dataset en_US
dcterms.accessRights openAccess
person.identifier.orcid #PLACEHOLDER_PARENT_METADATA_VALUE#
person.identifier.orcid #PLACEHOLDER_PARENT_METADATA_VALUE#
person.identifier.orcid #PLACEHOLDER_PARENT_METADATA_VALUE#
person.identifier.orcid #PLACEHOLDER_PARENT_METADATA_VALUE#
tuda.history.classification Version=2016-2020;409-05 Interaktive und intelligente Systeme, Bild- und Sprachverarbeitung, Computergraphik und Visualisierung

Files

Original bundle

Now showing 1 - 1 of 1
NameDescriptionSizeFormat
WikipediaTextSegmentation.zip10.98 MBZIP-Archivdateien Download