Command-line tools for {@linkplain it.unimi.di.archive4j.Archive archive} construction.
The classes in this package contain a main method, and can be used to build archives starting from a {@linkplain it.unimi.di.big.mg4j.document.DocumentSequence document sequence}.
The first thing to do is to expose your data as an MG4J {@link it.unimi.di.big.mg4j.document.DocumentSequence}; in general, many details of the archive construction process are similar to those of MG4J index construction process, so we suggest to be familiar with the documentation contained in {@link it.unimi.di.big.mg4j.document}. As an example, we will use a simple {@link it.unimi.di.big.mg4j.document.FileSetDocumentCollection}, that treats each file of a set as a document (words by default are maximal subsequences of alphanumeric characters). We assume that your Javadoc files are in /usr/share/javadoc, and create a serialised collection:
find /usr/share/javadoc/ -iname \*.html -type f | \ egrep -v "(package-|-tree|class-use|index-.*.html|allclasses)" | \ java it.unimi.di.big.mg4j.document.FileSetDocumentCollection \ -f HtmlDocumentFactory -p encoding=UTF-8 javadoc.collection
The -p encoding=UTF-8 option passes an encoding to the {@link it.unimi.di.big.mg4j.document.HtmlDocumentFactory} (The properties you can set depend on the chosen factory).
At this point, all you have to is to invoke {@link it.unimi.di.archive4j.tool.ArchiveBuilder}:
java -Xmx256M -server it.unimi.di.archive4j.tool.ArchiveBuilder \ -S javadoc.collection -Itext basename
The -Itext option specifies to index the text field of the {@link it.unimi.di.big.mg4j.document.HtmlDocumentFactory} that is processing your collection (there is also a title field). There are many more options, which you can examine using the online help.
Archives can be built in a distributed fashion, and then combined. To do so, however, you run manually the three archive construction phases. As an example, we suppose you have two input files containing one document per line. In this case, we can use the built-in {@link it.unimi.di.big.mg4j.document.InputStreamDocumentSequence}, which reads documents from standard input. Properties are specified directly to the tools running the archive construction phases.
First, you must {@link it.unimi.di.archive4j.tool.Preprocess} your files (for sake of simplicity, we assume they are both in the same directory, but of course you can run the two following commands in a distributed way):
java -Xmx256M -server it.unimi.di.archive4j.tool.Preprocess \ -Itext -p encoding=UTF-8 basename0 <your-input-file0 java -Xmx256M -server it.unimi.di.archive4j.tool.Preprocess \ -Itext -p encoding=UTF-8 basename1 <your-input-file1
Now you have to move all files generated to a single location, and {@linkplain it.unimi.di.archive4j.tool.MergePreprocessedData merge} them:
java -Xmx256M -server it.unimi.di.archive4j.tool.MergePreprocessedData \ basename basename0 basename1
In this phase you can also reduce the term set using various options. The resulting global data files have names stemmed from basename, and must be used to perform the actual {@link it.unimi.di.archive4j.tool.Scan}:
java -Xmx256M -server it.unimi.di.archive4j.tool.Scan \ -Itext -r -p encoding=UTF-8 basename0 basename <your-input-file0 java -Xmx256M -server it.unimi.di.archive4j.tool.Scan \ -Itext -r -p encoding=UTF-8 basename1 basename <your-input-file1
It is your responsibility to provide the same options related to the document sequence in the preprocessing and scanning phase. Note that we are passing the optional parameter basename, which will cause global data to be used for archive construction. The option -r suggests to build a random-access archive.
At the end, you have two partial archives that must be {@linkplain it.unimi.di.archive4j.tool.MergeSortedArchives merged}:
java -Xmx256M -server it.unimi.di.archive4j.tool.MergeSortedArchives \ -C basename basename0 basename1
The -C option suggests to concatenate the archives—document identifiers will be renumbered sequentially. It is also possible to specify an explicit document identifier using a map from {@link java.net.URI}s to integers, and in that case the resulting archives must be just merged (i.e., no -C option).
Warning: we are using the same basename for both merging phases. This is essential, as {@link it.unimi.di.archive4j.tool.MergePreprocessedData} computes some global statistics, such as term frequencies, which must be associated to archive basename.