Medical information retrieval test collection

[Home]

Background

This page provides a test collection with associated queries and relevance judgement for evaluating information retrieval of medical records. Details of the design of the evaluation framework are provided in the following paper:
Koopman, B., Bruza, P., Sitbon, L., Lawley, M. Evaluation of medical information retrieval. Poster proceedings of the 34st annual international ACM SIGIR conference on Research and development in information retrieval, 2011.
If you make use of this collection please cite the above publication.

Test corpus

As our test corpus we use the BLULab NLP repository, a collection of 81,617 de-identified clinical records from multiple U.S. hospitals during 2007. The collection is available to the community for research purposes. A number of different medical record types are provided, including: History and Physical Exams, Progress Notes, Consultation Reports, Radiology Reports, Emergency Department Reports, Discharge Summaries, Operative Reports, Cardiology Reports.

You will need to apply separately for access to the BLULab corpus.

Queries & Relevance Judgements

Below is a sample of the queries and relevane judgement for the BLULab collection. The complete data will be made available shortly.

Queries are provided in two formats - Indri style query XML documents and TREC style topics.

DescriptionDocument ListQueriesRelevance JudgementEvalation Run
Indri TREC
Samples sample.doclist sample.iq sample.topics sample.iq.qrel sample.iq.eval
All all.doclist all.iq all.topics all.iq.qrel all.iq.eval
Discharge Summaries, History & Physical Exams, Emergency Department Reports. No laboratory based reports. ds_er_hp.doclist ds_er_hp.iq ds_er_hp.topics ds_er_hp.iq.qrel ds_er_hp.iq.eval
Discharge Summaries ds.doclist ds.iq ds.topics ds.iq.qrel ds.iq.eval
Discharge Summaries (excluding administrative non-clinical codes) ds.doclist ds_clinical.iq ds_clinical.topics ds_clinical.iq.qrel ds_clinical.iq.eval
Discharge Summaries mapped to SNOMED CT concept descriptions ds.doclist snomedized-ds.iq snomedized-ds.topics snomedized-ds.iq.qrel snomedized-ds.iq.eval
Download all files: med_eval_all.zip (9.6M)

Other resources

Indri parameters

Indri retrieval parameter configuration:
	<parameters>
	  <index>/home/bevan/data/blulab/indri-index</index> 
	  <count>1500</count> 
	  <trecFormat>true</trecFormat> 
	  <baseline>tfidf</baseline>
	  <stemmer>
	    <name>porter</name>
	  </stemmer>
	</parameters>

ICD-9 codes

ICD-9 is published as a RTF text document and is therefore not distributed in machine readable form. We there provide CSV file of ICD codes and their corresponding description:

ICD-9 SNOMED CT mapping

We used the National Library of Medicine's UMLSK web service to map ICD codes to SNOMED CT concepts via UMLS. The copy of the resulting mapping file is provides: