The EUR-Lex dataset

Eneldo Loza Mencía (research@eneldo.net)
Johannes Fürnkranz (juffi@faw.jku.at)

Introduction

The EUR-Lex text collection is a collection of documents about European Union law. It contains many several different types of documents, including treaties, legislation, case-law and legislative proposals, which are indexed according to several orthogonal categorization schemes to allow for multiple search facilities. The most important categorization is provided by the EUROVOC descriptors, which is a topic hierarchy with almost 4000 categories regarding different aspects of European law.

This document collection provides an excellent opportunity to study text classification techniques for several reasons:

The database is a very challenging multilabel scenario due to the high number of possible labels (up to 4000). A first step towards analyzing this database was done by applying multilabel classification techniques on three of its categorization schemes in the following work:

Eneldo Loza Mencía and Johannes Fürnkranz.
Efficient multilabel classification algorithms for large-scale problems in the legal domain.
In Semantic Processing of Legal Texts. Springer. To appear.

Previous shorter versions of this work used a version of the dataset with slightly more documents  [2,3].

If you just want download the dataset, go straight ahead to the files section.  


EUR-Lex Repository

The EUR-Lex/CELEX (Communitatis Europeae LEX) Site provides a freely accessible repository for European Union law texts. The documents include the official Journal of the European Union, treaties, international agreements, legislation in force, legislation in preparation, case-law and parliamentary questions, treaties, international agreements, legislation in force, legislation in preparation, case-law and parliamentary questions. They are available in most of the languages of the EU, and in the HTML and PDF format, and in the HTML and PDF format. We retrieved the HTML versions with bibliographic notes recursively from all (non empty) documents in the English version of the Directory of Community legislation in force, in total 19,348 documents. Only documents related to secondary law (in contrast to primary law, the constitutional treaties of the European Union) and international agreements are included in this repository in this repository. The legal form of the included acts are mostly decisions (8,917 documents), regulations (5,706), directives (1,898) and agreements (1,597).

The bibliographic notes of the documents contain information such as dates of effect and validity and validity, authors, relationships to other documentsetc.relationships to other documents and classifications. The classifications include the assignment to several EUROVOC descriptors, directory codes and subject matters, hence all classifications are multilabel ones. EUROVOC is a multilingual thesaurus providing a controlled vocabulary for European Institutions for European Institutions. Documents in the documentation systems of the EU are indexed using this thesaurus. The directory codes are classes of the official classification hierarchy of the Directory of Community legislation in force. It contains 20 chapter headings with up to four sub-division levels.

The high number of 3,956 different EUROVOC descriptors were identified in the retrieved documents, each document is associated to 5.31 descriptors on average. In contrast there are only 201 different subject matters appearing in the dataset, with a mean of 2.21 labels per document, and 410 different directory codes, with a label set size of on average 1.29. Note that for the directory codes we used only the assignment to the leaf category as the parent nodes can be deduced from the leaf node assignment. Note that for the directory codes we used only the assignment to the leaf category as the parent nodes can be deduced from the leaf node assignment. 

The Data

The EUR-Lex dataset was retrieved, processed, prepared and used in the following way:
  1. Retrieval

    The number of 19940 documents was retrieved from the EUR-Lex/CELEX (Communitatis Europeae LEX) Site on 7th of July, 2006. The list is contained in download script file eurlex_download_EN_NOT.sh.gz (change the extension to ".bat" for Windows). An archive containing the retrieved HTML-files is contained in eurlex_html_EN_NOT.zip.

  2. Selection

    The retrieved documents were parsed in order to extract category assignments and the body text. An internal ID was given for each document retrieved in this way, starting from one and enumerating the documents according to the alphabetic ordering of their file names.

    A range of documents were excluded: 214 contained some error message e.g. that the document was not available in English (189), 316 contained an empty text field, 50 did not contain any relevant category information and 12 contained corrigendums which mostly referred to non-English translations (except one document) and were hence not available in English. Hence, the remaining number of documents is 19348.

    A subset used in early experiments still contained 19596 documents [2,3], 248 documents were additionally removed after a review of the dataset [5,6]. Removed documents still obtained an ID so that the numbering of the document IDs in the dataset is not continuous.

    A mapping of document ID to filename and CELEX ID together with a column indicating wether the document was removed is available in eurlex_ID_mappings.csv.gz. An explanation of the CELEX IDs and further useful information can be found in http://eur-lex.europa.eu/en/legis/avis.htm. The log file of the extraction and selection process is named eurlex_processing.log.gz. Note that two documents (no. 244 and 2736) were excluded manually).

  3. Preprocessing

    For the following steps the library of the Apache Lucene Project (Version 2.1.0) was used: the text was transformed to lower case (LowerCaseTokenizer), stop words from a standard English stop word list were removed (StopFilter), and finally the Porter stemmer algorithm was applied (PorterStemFilter). 

    The file eurlex_tokenstring.arff.gz contains the token strings in the ARFF format of Weka. The first attribute is the ID and the second is of type String and contains the converted list of tokens. 

  4. Class Assignments

    The class assignments are saved in the TREC qrels format, as is done for the Reuters RCV1v2  collection. Each line codifies a document-class link in the format "<class_identifier> <document_ID> 1". As an example, the following listing shows document number 13286 belonging to the classes "Oils and Fats" and "Commercial Policy" and document 13287 assigned to "Agriculture" and again "Oils and Fats".

    oils_and_fats 13286 1
    commercial_policy 13286 1
    agriculture 13287 1
    oils_and_fats 13287 1

    In total, three categorization types were extracted: the subject matter, directory code and EUROVOC descriptor assignment. Directory code classes are organized in a hierarchy of four levels. When a document is assigned to a class, it is also linked to all the parent classes. E.g. document 13286 belongs to "Oils and fats" (code 03.60.59.00), and therefore also to "Products subject to market organisation" (03.60.00.00) and "Agriculture" (03.00.00.00). A ".00" in the numeric code denotes the root node of this level, i.e. the parent node.  Since the parent nodes can be deduced from the child nodes, we commonly only take into consideration the deepest class. The class assignments are contained in eurlex_id2class.zip in the following files:

    In addition, different views on the Directory Code hierarchy were constructed:

    Note that these files may contain mappings to documents that were excluded of the final document set due to empty or uncomplete text fields. This has to be taken into account especially for computing statistics.

  5. Cross Validation

    The instances of eurlex_tokenstring.arff.gz were split randomly into 10 folds in order to perform Cross Validation. The resulting 10 train/test splits are contained in eurlex_tokenstring_CV10.zip.

  6. Feature Vector Representation

    Each train/test set pair was separately converted into the TF-IDF feature representation. In order to ensure that no information from the test set is included in the training set, the IDF statistics were only computed on the training sets. The conversion was done with the StringToWordVector class of Weka 3.5.5.  Afterwards, the attributes were ordered according to their document frequency. The following command was used:

    for (( i=1; $i <= 10; i++ ))
    do
    java -Xmx1000M -cp weka.jar weka.filters.unsupervised.attribute.StringToWordVector -R 2 -W 999999 -C -I -N 0 -i eurlex_tokenstring_CV${i}-10_train.arff -o temp_eurlex_CV${i}-10_train.arff -b -r eurlex_tokenstring_CV${i}-10_test.arff -s temp_eurlex_CV${i}-10_test.arff
    java -Xmx1000M -cp weka.jar:. weka.filters.supervised.attribute.AttributeSelection -E DocumentFrequencyAttributeEval -S "weka.attributeSelection.Ranker -T 0.0 -N -1" -i temp_eurlex_CV${i}-10_train.arff -o eurlex_CV${i}-10_train.arff -b -r temp_eurlex_CV${i}-10_test.arff -s eurlex_CV${i}-10_test.arff
    done

    We wrote an attribute evaluation class for Weka's AttributeSelection in order to sort the attributes by their document frequency. The source code is contained in DocumentFrequencyAttributeEval.java, the class file in DocumentFrequencyAttributeEval.class. The resulting data files are contained in eurlex_CV10.zip.

  7. Feature Selection

    We used feature selection in our experiments in order to reduce the computational costs, especially the memory consumption. A version of eurlex_CV10.zip with only the first 5000 features with respect to their document frequency is contained in eurlex_nA-5k_CV10.zip and was obtained by executing the following script:

    for (( i=1; $i <= 10; i++ ))
    do
    java -Xmx1000M -cp weka.jar weka.filters.unsupervised.attribute.Remove -V -R first-5001 -i eurlex_CV${i}-10_train.arff -o eurlex_nA-5k_CV${i}-10_train.arff
    java -Xmx1000M -cp weka.jar weka.filters.unsupervised.attribute.Remove -V -R first-5001 -i eurlex_CV${i}-10_test.arff -o eurlex_nA-5k_CV${i}-10_test.arff
    done

    Please refer to this dataset if you would like to directly compare to our results.

  8. Mulan

    Mulan is a library for multilabel classification developed by the Machine Learning and Knowledge Discovery Group at the Aristotle University of Thessaloniki. It uses a different data format, but also based on ARFF. The label information is coded as binary attributes at the end of the attribute list. We do not provide the data directly in this format since this would result in too many different combinations. Instead we provide you with a small Java (1.5) program in order to convert the Arff files as encountered in the dataset into the Mulan format. The command line of the program (Convert2MulanArff.java,Convert2MulanArff.class) is as follows:

    java -mx1000M -cp weka.jar:. Convert2MulanArff inputMLdata.arff id2class_mapping.qrels ouputMLdata.arff
  9. Summary of available files


    eurlex_download_EN_NOT.sh.gz 117K Download script for the source documents of the EUR-Lex dataset.
    eurlex_html_EN_NOT.zip 161M Source documents in HTML format.
    english.stop 3.6K Standard English stop word list.
    eurlex_processing.log.gz 1.3M Log file of processing of the source documents.
    eurlex_ID_mappings.csv.gz 157K Table of mapping between file name, Celex-ID, intern document ID and whether a document was excluded.
    eurlex_id2class.zip 1.6M Mappings of documents to the different categorizations. You will need this for experimentation.
    eurlex_tokenstring.arff.gz 36M Contains the preprocessed (basically stemming) documents in the Arff file (token list format). Use this for further preprocessing, e.g. for different cross validation splits.
    eurlex_tokenstring_CV10.zip 382M Ten fold cross validation training and test splits used in the experiments in token list format. Use this e.g. for different term weighting or feature vector representation computations.
    DocumentFrequencyAttributeEval.class 4.3K Program that orders attributes according to their document frequency.
    DocumentFrequencyAttributeEval.java 7.3K Source code of DocumentFrequencyAttributeEval.class.
    eurlex_CV10.zip 269M Cross validation splits of TF-IDF representation of the documents. Use this e.g. for a different feature selection.
    eurlex_nA-5k_CV10.zip 218M Cross validation splits of TF-IDF representation of the documents with the first 5000 most frequent features selected, as used in the experiments. Use this for a direct comparison.
    Convert2MulanArff.class 5.3K Converts the indexed Arff format into the Mulan Arff format.
    Convert2MulanArff.java 4.6K Source code of Convert2MulanArff.class .


  10. Terms of use

    Concerning the original EUR-Lex documents and all its direct derivates containing text or other information of these documents, the data is as freely available as determined by the copyright notice of the EUR-Lex site. Additional data provided by the authors on this site is freely available. Nevertheless, we would be glad if you would cite this site or [6] if you use in some way the EUR-Lex dataset. We would be also pleased to list your work in the references section.

References

The EUR-Lex dataset was analyzed and used in the following publications, partially using a previous, slightly different version. Download links can be found on the publications site of our group.

[1]
Eneldo Loza Mencía and Johannes Fürnkranz.
An evaluation of efficient multilabel classification algorithms for large-scale problems in the legal domain.
In LWA 2007: Lernen - Wissen - Adaption, Workshop Proceedings, pages 126-132, 2007.

[2]
Eneldo Loza Mencía and Johannes Fürnkranz.
Efficient multilabel classification algorithms for large-scale problems in the legal domain.
In Proceedings of the Language Resources and Evaluation Conference (LREC) Workshop on Semantic Processing of Legal Texts, pages 23-32, Marrakech, Morocco, 2008

[3]
Eneldo Loza Mencía and Johannes Fürnkranz.
Efficient pairwise multilabel classification for large-scale problems in the legal domain.
In Walter Daelemans, Bart Goethals, and Katharina Morik, editors, Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Disocvery in Databases (ECML-PKDD-2008), Part II, pages 50-65, Antwerp, Belgium, 2008. Springer-Verlag

[4]
Eneldo Loza Mencía, Sang-Hyeun Park, and Johannes Fürnkranz.
Advances in efficient pairwise multilabel classification.
Technical Report TUD-KE-2008-06, Technische Universität Darmstadt, Knowledge Engineering Group, 2008.
http://www.ke.informatik.tu-darmstadt.de/publications/reports/tud-ke-2008-06.pdf


[5]
Eneldo Loza Mencía, Sang-Hyeun Park, and Johannes Fürnkranz.
Efficient voting prediction for pairwise multilabel classification.
In Proceedings of the 11th European Symposium on Artificial Neural Networks (ESANN-09). Springer, 2009.


[6]
Eneldo Loza Mencía and Johannes Fürnkranz.
Efficient multilabel classification algorithms for large-scale problems in the legal domain.
In Semantic Processing of Legal Texts. Springer. To appear.



(last updated 2009-07-14)