Provides classes for working with MEDLINE citations in XML format (particularly, for the TREC 2004-5 genomics tracks). The TREC 2004 and TREC 2005 genomics tracks used a 10-year subset of MEDLINE totaling 4,591,008 records (citations); this is commonly called the MEDLINE04 collection. These classes are designed to work with the XML-formatted version of the distribution, which comes in five different files:
Here are the two steps for preparing the collection for processing with Hadoop:
NumberMedlineCitations
accomplishes this. Here is a sample invocation:hadoop jar cloud9.jar edu.umd.cloud9.collection.medline.NumberMedlineCitations \ /umd/collections/medline04.raw/ \ /user/jimmylin/medline-docid-tmp \ /user/jimmylin/docno.mapping 100
After the corpus has been prepared, it is ready for processing with
Hadoop. The
class DemoCountMedlineCitations
is a simple demo program that counts all documents in the collection.
It provides a skeleton for MapReduce programs that process the
collection. Here is a sample invocation:
hadoop jar cloud9.jar edu.umd.cloud9.collection.medline.DemoCountMedlineCitations \ /umd/collections/medline04.raw/ \ /user/jimmylin/count-tmp \ /user/jimmylin/docno.mapping 100
The output key-value pairs in this sample program are the docid to docno mappings.