The EUR-Lex text collection is a collection of documents about European Union law. It contains many several different types of documents, including treaties, legislation, case-law and legislative proposals, which are indexed according to several orthogonal categorization schemes to allow for multiple search facilities. The most important categorization is provided by the EUROVOC descriptors, which is a topic hierarchy with almost 4000 categories regarding different aspects of European law.
This document collection provides an excellent opportunity to study text classification techniques for several reasons:
The database is a very challenging multilabel scenario due to the high number of possible labels (up to 4000). A first step towards analyzing this database was done by applying multilabel classification techniques on three of its categorization schemes in the following work:
Eneldo Loza Mencía and Johannes Fürnkranz.
Efficient multilabel classification algorithms for large-scale problems
in the legal domain.
In Semantic
Processing of Legal Texts. Springer. To appear.
Previous shorter versions of this work used a version of the dataset with slightly more documents [2,3].
If you just want download the dataset, go straight ahead to the files section.
The EUR-Lex/CELEX (Communitatis Europeae LEX) Site provides a freely accessible repository for European Union law texts. The documents include the official Journal of the European Union, treaties, international agreements, legislation in force, legislation in preparation, case-law and parliamentary questions, treaties, international agreements, legislation in force, legislation in preparation, case-law and parliamentary questions. They are available in most of the languages of the EU, and in the HTML and PDF format, and in the HTML and PDF format. We retrieved the HTML versions with bibliographic notes recursively from all (non empty) documents in the English version of the Directory of Community legislation in force, in total 19,348 documents. Only documents related to secondary law (in contrast to primary law, the constitutional treaties of the European Union) and international agreements are included in this repository in this repository. The legal form of the included acts are mostly decisions (8,917 documents), regulations (5,706), directives (1,898) and agreements (1,597).
The bibliographic notes of the documents contain information such as dates of effect and validity and validity, authors, relationships to other documentsetc.relationships to other documents and classifications. The classifications include the assignment to several EUROVOC descriptors, directory codes and subject matters, hence all classifications are multilabel ones. EUROVOC is a multilingual thesaurus providing a controlled vocabulary for European Institutions for European Institutions. Documents in the documentation systems of the EU are indexed using this thesaurus. The directory codes are classes of the official classification hierarchy of the Directory of Community legislation in force. It contains 20 chapter headings with up to four sub-division levels.
The high number of 3,956 different EUROVOC descriptors were identified in the retrieved documents, each document is associated to 5.31 descriptors on average. In contrast there are only 201 different subject matters appearing in the dataset, with a mean of 2.21 labels per document, and 410 different directory codes, with a label set size of on average 1.29. Note that for the directory codes we used only the assignment to the leaf category as the parent nodes can be deduced from the leaf node assignment. Note that for the directory codes we used only the assignment to the leaf category as the parent nodes can be deduced from the leaf node assignment.
The number of 19940 documents was retrieved from the EUR-Lex/CELEX (Communitatis Europeae LEX) Site on 7th of July, 2006. The list is contained in download script file eurlex_download_EN_NOT.sh.gz (change the extension to ".bat" for Windows). An archive containing the retrieved HTML-files is contained in eurlex_html_EN_NOT.zip.
The retrieved documents were parsed in order to extract
category assignments and the body text. An internal ID was given for
each document retrieved
in this way, starting from one and enumerating the documents according
to the alphabetic ordering of their file names.
A range of documents were excluded: 214 contained some error message e.g. that the document was not available in English (189), 316 contained an empty text field, 50 did not contain any relevant category information and 12 contained corrigendums which mostly referred to non-English translations (except one document) and were hence not available in English. Hence, the remaining number of documents is 19348.
A subset used in early experiments still contained 19596
documents [2,3], 248 documents
were additionally removed after a review
of the dataset [5,6]. Removed documents still
obtained an ID so that
the numbering of the document IDs in the dataset is not continuous.
A mapping of document ID to filename and CELEX ID
together with a column indicating wether the document was removed is
available in eurlex_ID_mappings.csv.gz.
An explanation of the CELEX IDs and further useful information
can be found in http://eur-lex.europa.eu/en/legis/avis.htm.
The log file of the extraction and selection process
is named eurlex_processing.log.gz.
Note that two documents (no. 244 and 2736) were excluded manually).
For the following steps the library of the Apache Lucene Project (Version 2.1.0) was used: the text was transformed to lower case (LowerCaseTokenizer), stop words from a standard English stop word list were removed (StopFilter), and finally the Porter stemmer algorithm was applied (PorterStemFilter).
The file eurlex_tokenstring.arff.gz contains the token strings in the ARFF format of Weka. The first attribute is the ID and the second is of type String and contains the converted list of tokens.
The class assignments are saved in the TREC qrels format, as is done for the Reuters RCV1v2 collection. Each line codifies a document-class link in the format "<class_identifier> <document_ID> 1". As an example, the following listing shows document number 13286 belonging to the classes "Oils and Fats" and "Commercial Policy" and document 13287 assigned to "Agriculture" and again "Oils and Fats".
oils_and_fats 13286 1
commercial_policy 13286 1
agriculture 13287 1
oils_and_fats 13287 1
In total, three categorization types were extracted: the subject matter, directory code and EUROVOC descriptor assignment. Directory code classes are organized in a hierarchy of four levels. When a document is assigned to a class, it is also linked to all the parent classes. E.g. document 13286 belongs to "Oils and fats" (code 03.60.59.00), and therefore also to "Products subject to market organisation" (03.60.00.00) and "Agriculture" (03.00.00.00). A ".00" in the numeric code denotes the root node of this level, i.e. the parent node. Since the parent nodes can be deduced from the child nodes, we commonly only take into consideration the deepest class. The class assignments are contained in eurlex_id2class.zip in the following files:
Note that these files may contain mappings to documents that were excluded of the final document set due to empty or uncomplete text fields. This has to be taken into account especially for computing statistics.
The instances of eurlex_tokenstring.arff.gz were split randomly into 10 folds in order to perform Cross Validation. The resulting 10 train/test splits are contained in eurlex_tokenstring_CV10.zip.
Each train/test set pair was separately converted into the TF-IDF feature representation. In order to ensure that no information from the test set is included in the training set, the IDF statistics were only computed on the training sets. The conversion was done with the StringToWordVector class of Weka 3.5.5. Afterwards, the attributes were ordered according to their document frequency. The following command was used:
for (( i=1; $i <= 10; i++ ))
do
java -Xmx1000M -cp weka.jar weka.filters.unsupervised.attribute.StringToWordVector -R 2 -W 999999 -C -I -N 0 -i eurlex_tokenstring_CV${i}-10_train.arff -o temp_eurlex_CV${i}-10_train.arff -b -r eurlex_tokenstring_CV${i}-10_test.arff -s temp_eurlex_CV${i}-10_test.arff
java -Xmx1000M -cp weka.jar:. weka.filters.supervised.attribute.AttributeSelection -E DocumentFrequencyAttributeEval -S "weka.attributeSelection.Ranker -T 0.0 -N -1" -i temp_eurlex_CV${i}-10_train.arff -o eurlex_CV${i}-10_train.arff -b -r temp_eurlex_CV${i}-10_test.arff -s eurlex_CV${i}-10_test.arff
done
We wrote an attribute evaluation class for Weka's AttributeSelection in order to sort the attributes by their document frequency. The source code is contained in DocumentFrequencyAttributeEval.java, the class file in DocumentFrequencyAttributeEval.class. The resulting data files are contained in eurlex_CV10.zip.
We used feature selection in our experiments in order to
reduce the
computational costs, especially the memory consumption. A version
of eurlex_CV10.zip
with only the first 5000 features with respect to their document
frequency is contained in eurlex_nA-5k_CV10.zip
and was obtained by executing the following script:
for (( i=1; $i <= 10; i++ ))
do
java -Xmx1000M -cp weka.jar weka.filters.unsupervised.attribute.Remove -V -R first-5001 -i eurlex_CV${i}-10_train.arff -o eurlex_nA-5k_CV${i}-10_train.arff
java -Xmx1000M -cp weka.jar weka.filters.unsupervised.attribute.Remove -V -R first-5001 -i eurlex_CV${i}-10_test.arff -o eurlex_nA-5k_CV${i}-10_test.arff
done
Please refer to this dataset if you would like to directly compare to our results.
Mulan
is a library for multilabel classification developed by the Machine Learning and
Knowledge Discovery Group at the Aristotle University of
Thessaloniki. It uses a different data format, but also based
on ARFF. The label information is coded as binary attributes at the end
of the attribute list. We do not provide the data directly in this
format since this would result in too many different combinations.
Instead we provide you with a small Java (1.5) program in order to
convert the Arff files as encountered in the dataset into the Mulan
format. The command line of the program (Convert2MulanArff.java,Convert2MulanArff.class)
is as follows:
java -mx1000M -cp weka.jar:. Convert2MulanArff inputMLdata.arff id2class_mapping.qrels ouputMLdata.arff
eurlex_download_EN_NOT.sh.gz | 117K | Download script for the source documents of the EUR-Lex dataset. |
eurlex_html_EN_NOT.zip | 161M | Source documents in HTML format. |
english.stop | 3.6K | Standard English stop word list. |
eurlex_processing.log.gz | 1.3M | Log file of processing of the source documents. |
eurlex_ID_mappings.csv.gz | 157K | Table of mapping between file name, Celex-ID, intern document ID and whether a document was excluded. |
eurlex_id2class.zip | 1.6M | Mappings of documents to the different categorizations. You will need this for experimentation. |
eurlex_tokenstring.arff.gz | 36M | Contains the preprocessed (basically stemming) documents in the Arff file (token list format). Use this for further preprocessing, e.g. for different cross validation splits. |
eurlex_tokenstring_CV10.zip | 382M | Ten fold cross validation training and test splits used in the experiments in token list format. Use this e.g. for different term weighting or feature vector representation computations. |
DocumentFrequencyAttributeEval.class | 4.3K | Program that orders attributes according to their document frequency. |
DocumentFrequencyAttributeEval.java | 7.3K | Source code of DocumentFrequencyAttributeEval.class. |
eurlex_CV10.zip | 269M | Cross validation splits of TF-IDF representation of the documents. Use this e.g. for a different feature selection. |
eurlex_nA-5k_CV10.zip | 218M | Cross validation splits of TF-IDF representation of the documents with the first 5000 most frequent features selected, as used in the experiments. Use this for a direct comparison. |
Convert2MulanArff.class | 5.3K | Converts the indexed Arff format into the Mulan Arff format. |
Convert2MulanArff.java | 4.6K | Source code of Convert2MulanArff.class . |
Concerning the original EUR-Lex documents and all its direct derivates containing text or other information of these documents, the data is as freely available as determined by the copyright notice of the EUR-Lex site. Additional data provided by the authors on this site is freely available. Nevertheless, we would be glad if you would cite this site or [6] if you use in some way the EUR-Lex dataset. We would be also pleased to list your work in the references section.
The EUR-Lex dataset was analyzed and used in the following publications, partially using a previous, slightly different version. Download links can be found on the publications site of our group.