TUdatalib Upgrade

Am 2. Juni erfolgte ein TUdatalib Upgrade auf eine neue Softwareversion. Dieses Upgrade bringt wichtige Neuerungen mit sich. Eine Übersicht finden Sie in der Dokumentation
On June 2nd, TUdatalib was upgraded to a new software version. This upgrade introduced major changes to the system. Please see our documentation for an overview.

 
Open Access

Wikipedia Text Segmentation

Loading...
Thumbnail Image

Date

2020-07-25

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

For corpus generation, we extracted top-level sections of featured articles and concatenated their textual contents to a pure-text corpus file. The content of a section is constituted by the concatenation of the text of its paragraph elements and the content of contained sections. Particularly, other elements such as tables and image captions are ignored during generating the text for a section because text segmentation is meant to be applied to prose and not to pieces of information such as table fields. Furthermore, sections with one of the titles ``See also'', ``References'', and ``External links'' are skipped as they do not contain information where segmentation makes sense.

Keywords

Citation

Endorsement

Project(s)

Faculty

License

Except where otherwise noted, this license is described as CC BY-SA 3.0