This entry contains the resources used in and resulting from

Eneldo Loza Mencía, Gerard de Melo and Jinseok Nam, Medical Concept Embeddings via Labeled Background Corpora, in: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), 2016 

In recent years, we have seen an increasing amount of interest in low-dimensional vector representations of words. Among other things, these facilitate computing word similarity and relatedness scores. The most well-known example of algorithms to produce representations of this sort are the word2vec approaches. In this paper, we investigate a new model to induce such vector spaces for medical concepts, based on a joint objective that exploits not only word co-occurrences but also manually labeled documents, as available from sources such as PubMed. Our extensive experimental analysis shows that our embeddings lead to significantly higher correlations with human similarity and relatedness assessments than previous work. Due to the simplicity and versatility of vector representations, these findings suggest that our resource can easily be used as a drop-in replacement to improve any systems relying on medical concept similarity measures.

The vector representations

    BioASQ_train_full_no_desc.vectors: The label, word, and document embeddings, as python objects.
    MeSH_name_id_mapping_2015.txt: Mapping between MeSH concept name and MeSH-ID
    seen_label_vocabulary.txt: list of medical concepts (labels) for which embeddings exist, ordered according to number of occurences
    word_vocabulary.txt: list of words for which embeddings exist, ordered according to number of occurrences

Example Code

You will find a full example script for evaluating the similarity between pairs of medical concepts in ComputeEmbeddingsSimilaritiesForPairs.?zip, together with the necessary text files.? Just copy the embeddings vector file at the right position (data/medical_aitext). The main steps you have to perform are the following.

Load the embeddings:
description_model = load_model('BioASQ_train_full_no_desc.vectors')
For looking up an embedding vector, find out the label index (see seen_label_vocabulary.txt) and then just make a look-up.? Getting the word embeddings works very similarly.
emb=description_model['label_emb'][label_index]
When you have embeddings for two concepts, you can obtain the similarity by just computing the inner product.? Do not mix up label and word embeddings.
sim_desc=((1 - spatial.distance.cosine(left_emb,right_emb))+1)/2 #maps to [0;1]
You can compute the Spearman correlation by using the built-in functions from scipy:
sim_rho, _ = spearmanr(target_sim_scores, label_emb_sim_scores)

Software

The embeddings were learned with the software AiTextML written by Jinseok Nam, see also the corresponding publication. The source code and installation instructions are available at the project site at GitHub.
Other Resources

Assessed Pairs of Medical Concepts

Around 500-600 pairs of medical concepts were assessed by human experts regarding their similarity (UMNSRS_similarity.csv) and relatedness (UMNSRS_relatedness.csv) and made available through Medical Residents Similarity and Relatedness Set datasets.? In addition, the Medical Coders Set (MayoSRS.?terms) provides 101 pairs.? All dataset were made available by the University of Minnesota.

Embeddings trained from PubMed

Pretrained word embeddings trained on abstracts and full documents from PubMed  and the Wikipedia were used from the Natural Language Processing Laboratory.

Web-Interface for computing path-based proximity

A web-interface to the UMLS::Similarity software package for obtaining similarity and relatedness measures between biomedical terms is available.

UMLS Ontology

The Unified Medical Language System ontology is available through a web interface or you can download it from the web site.? However, you will need a (usually free) account.? The UMLS ontology also includes a mapping to the MeSH ontology.

MeSH 2015 ontology

The used concepts are from the Medical Subjects Headings ontology. You can download the descriptors from http://?www.?nlm.?nih.?gov/?mesh/filelist.?html. Please note that we used 2015 MeSH in our experiments.

BioASQ background corpus

The BioASQ dataset is a subset from the PubMed database for biomedical publications and can be downloaded by the competition site (Task 3a) after registration.

Terms of Use

The data provided by the authors on this site is freely available.? For external software (including AiTextML) or data that may be included in the distributables like libraries or datasets, please contact the original authors for their terms of use.? Nevertheless, we would be glad if you would cite this site or our paper if you use the provided software or data.