Provides classes for working with the TREC collection (particularly disks 4 and 5). TREC disks 4 and 5 represent one of the standard collections used in information retrieval research. There are two common "views" of the collection:

Here are the two steps for preparing the collection for processing with Hadoop:

  1. The distribution of the collection consists of many individual small files (listed above). Since Hadoop works better with large files, it is advisable to cat the individual files together (e.g., with a simple Perl script).
  2. Since many information retrieval algorithms require a sequential numbering of documents, it is necessary to build a mapping between docids (e.g., LA123190-0134) and docnos (sequentially-numbered ints). The class NumberTrecDocuments accomplishes this. Here is a sample invocation:
  3. hadoop jar cloud9.jar edu.umd.cloud9.collection.trec.NumberTrecDocuments \
    /umd/collections/trec/trec4-5_noCRFR.xml \
    /user/jimmylin/trec-docid-tmp \
    /user/jimmylin/docno.mapping 100
    

After the corpus has been prepared, it is ready for processing with Hadoop. The class DemoCountTrecDocuments is a simple demo program that counts all documents in the collection. It provides a skeleton for MapReduce programs that process the collection. Here is a sample invocation:

hadoop jar cloud9.jar edu.umd.cloud9.collection.trec.DemoCountTrecDocuments \
/umd/collections/trec/trec4-5_noCRFR.xml \
/user/jimmylin/count-tmp \
/user/jimmylin/docno.mapping 100

The output key-value pairs in this sample program are the docid to docno mappings.