Wikipedia Text Segmentation

Simple item page

dc.contributor.author	Martin, Marko
dc.contributor.author	Zesch, Torsten
dc.contributor.author	Erbs, Nicolai
dc.contributor.author	Gurevych, Iryna
dc.date.accessioned	2020-07-25T09:41:19Z
dc.date.available	2020-07-25T09:41:19Z
dc.date.created	2012
dc.date.issued	2020-07-25
dc.description	For corpus generation, we extracted top-level sections of featured articles and concatenated their textual contents to a pure-text corpus file. The content of a section is constituted by the concatenation of the text of its paragraph elements and the content of contained sections. Particularly, other elements such as tables and image captions are ignored during generating the text for a section because text segmentation is meant to be applied to prose and not to pieces of information such as table fields. Furthermore, sections with one of the titles ``See also'', ``References'', and ``External links'' are skipped as they do not contain information where segmentation makes sense.	en_US
dc.identifier.uri	https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2454
dc.language.iso	en	en_US
dc.rights	CC BY-SA 3.0
dc.rights.license	other
dc.rights.uri	https://creativecommons.org/licenses/by-sa/3.0/
dc.subject.classification	4.43-04
dc.subject.classification	4.43-05
dc.subject.ddc	004
dc.title	Wikipedia Text Segmentation	en_US
dc.type	Dataset	en_US
dcterms.accessRights	openAccess
person.identifier.orcid	#PLACEHOLDER_PARENT_METADATA_VALUE#
person.identifier.orcid	#PLACEHOLDER_PARENT_METADATA_VALUE#
person.identifier.orcid	#PLACEHOLDER_PARENT_METADATA_VALUE#
person.identifier.orcid	#PLACEHOLDER_PARENT_METADATA_VALUE#
tuda.history.classification	Version=2016-2020;409-05 Interaktive und intelligente Systeme, Bild- und Sprachverarbeitung, Computergraphik und Visualisierung

Files

Original bundle

Now showing 1 - 1 of 1

Name	Description	Size	Format
WikipediaTextSegmentation.zip		10.98 MB	ZIP-Archivdateien	Download

Simple item page

Collections

Text Segmentation