Provides classes for working with MEDLINE citations in XML format (particularly, for the TREC 2004-5 genomics tracks). The TREC 2004 and TREC 2005 genomics tracks used a 10-year subset of MEDLINE totaling 4,591,008 records (citations); this is commonly called the MEDLINE04 collection. These classes are designed to work with the XML-formatted version of the distribution, which comes in five different files:

Here are the two steps for preparing the collection for processing with Hadoop:

  1. Uncompresss the XML files and put them in HDFS. Working with the uncompressed versions makes it possible to split processing across many mappers.
  2. Since many information retrieval algorithms require a sequential numbering of documents (i.e., Wikipedia articles), it is necessary to build a mapping between docids (i.e., PMIDs) and docnos (sequentially-numbered ints). The class NumberMedlineCitations accomplishes this. Here is a sample invocation:
  3. hadoop jar cloud9.jar edu.umd.cloud9.collection.medline.NumberMedlineCitations \
    /umd/collections/medline04.raw/ \
    /user/jimmylin/medline-docid-tmp \
    /user/jimmylin/docno.mapping 100
    

After the corpus has been prepared, it is ready for processing with Hadoop. The class DemoCountMedlineCitations is a simple demo program that counts all documents in the collection. It provides a skeleton for MapReduce programs that process the collection. Here is a sample invocation:

 hadoop jar cloud9.jar edu.umd.cloud9.collection.medline.DemoCountMedlineCitations \
 /umd/collections/medline04.raw/ \
 /user/jimmylin/count-tmp \
 /user/jimmylin/docno.mapping 100

The output key-value pairs in this sample program are the docid to docno mappings.