Show simple item record

dc.contributor.authorMartin, Marko
dc.contributor.authorZesch, Torsten
dc.contributor.authorErbs, Nicolai
dc.contributor.authorGurevych, Iryna
dc.date.accessioned2020-07-25T09:41:19Z
dc.date.available2020-07-25T09:41:19Z
dc.date.issued2012
dc.identifier.urihttps://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2454
dc.descriptionFor corpus generation, we extracted top-level sections of featured articles and concatenated their textual contents to a pure-text corpus file. The content of a section is constituted by the concatenation of the text of its paragraph elements and the content of contained sections. Particularly, other elements such as tables and image captions are ignored during generating the text for a section because text segmentation is meant to be applied to prose and not to pieces of information such as table fields. Furthermore, sections with one of the titles ``See also'', ``References'', and ``External links'' are skipped as they do not contain information where segmentation makes sense.en_US
dc.language.isoenen_US
dc.rightsCC BY-SA 3.0
dc.rights.urihttps://creativecommons.org/licenses/by-sa/3.0/
dc.subject.classification409-05 Interaktive und intelligente Systeme, Bild- und Sprachverarbeitung, Computergraphik und Visualisierungen_US
dc.subject.ddc004
dc.titleWikipedia Text Segmentationen_US
dc.typeDataseten_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

CC BY-SA 3.0
Except where otherwise noted, this item's license is described as CC BY-SA 3.0