Provides classes for working with the TREC collection (particularly disks 4 and 5). TREC disks 4 and 5 represent one of the standard collections used in information retrieval research. There are two common "views" of the collection:
Here are the two steps for preparing the collection for processing with Hadoop:
LA123190-0134
) and docnos
(sequentially-numbered ints). The
class NumberTrecDocuments
accomplishes this. Here is a sample invocation:hadoop jar cloud9.jar edu.umd.cloud9.collection.trec.NumberTrecDocuments \ /umd/collections/trec/trec4-5_noCRFR.xml \ /user/jimmylin/trec-docid-tmp \ /user/jimmylin/docno.mapping 100
After the corpus has been prepared, it is ready for processing with
Hadoop. The
class DemoCountTrecDocuments
is a simple demo program that counts all documents in the collection.
It provides a skeleton for MapReduce programs that process the
collection. Here is a sample invocation:
hadoop jar cloud9.jar edu.umd.cloud9.collection.trec.DemoCountTrecDocuments \ /umd/collections/trec/trec4-5_noCRFR.xml \ /user/jimmylin/count-tmp \ /user/jimmylin/docno.mapping 100
The output key-value pairs in this sample program are the docid to docno mappings.