You need to quickly search a collection of XML documents, and, to do this, you need to create an index of terms keeping track of the context in which these terms appear.
Use Apache Lucene and Commons Digester and create an index of
Lucene Document
objects for the
lowest level of granularity you wish to search. For example, if you are
attempting to search for speeches in a Shakespeare play that contain
specific terms, create a Lucene Document
object for each speech. For the
purposes of this recipe, assume that you are attempting to index
Shakespeare plays stored in the following XML format:
<?xml version="1.0"?> <PLAY> <TITLE>All's Well That Ends Well</TITLE> <ACT> <TITLE>ACT I</TITLE> <SCENE> <TITLE>SCENE I. Rousillon. The COUNT's palace.</TITLE> <SPEECH> <SPEAKER>COUNTESS</SPEAKER> <LINE>In delivering my son from me, I bury a second husband.</LINE> </SPEECH> <SPEECH> <SPEAKER>BERTRAM</SPEAKER> <LINE>And I in going, madam, weep o'er my father's death</LINE> <LINE>anew: but I must attend his majesty's command, to</LINE> <LINE>whom I am now in ward, evermore in subjection.</LINE> </SPEECH> </SCENE> </ACT> </PLAY>
The following class creates a Lucene index of Shakespeare
speeches, reading XML files for each play in the ./data/Shakespeare
directory, and calling the
PlayIndexer
to create Lucene Document
objects for every speech. These
Document
objects are then written to a Lucene index using an
IndexWriter
:
import java.io.File; import java.io.FilenameFilter; import org.apache.log4j.Logger; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; import org.apache.oro.io.GlobFilenameFilter; File dataDir = new File("./data/shakespeare"); logger.info( "Looking for XML files in " FilenameFilter xmlFilter = new GlobFilenameFilter( "*.xml" ); File[] xmlFiles = dataDir.listFiles( xmlFilter ); logger.info( "Creating Index"); IndexWriter writer = new IndexWriter("index", new SimpleAnalyzer( ), true); PlayIndexer playIndexer = new PlayIndexer( writer ); playIndexer.init( ); for (int i = 0; i < xmlFiles.length; i++) { playIndexer.index(xmlFiles[i]); } writer.optimize( ); writer.close( ); logger.info( "Parsing Complete, Index Created");
The PlayIndexer
class, shown in
Example 12-1, parses
each XML file and creates Document
objects that are written to an IndexWriter
. The PlayIndexer
uses Commons Digester to create a Lucene Document
object for every speech. The init( )
method creates a Digester
instance designed to interact with an
inner class, DigestContext
, which
keeps track of the current context of a speech—play
, act
,
scene
, speaker
—and the textual contents of a speech
. At the end of every speech element,
the DigestContext
invokes the
processSpeech( )
method that creates a Lucene Document
for each speech and writes this
Document
to the Lucene IndexWriter
. Because
each Document
is associated with the
specific context of a speech, it will be possible to obtain a specific
location for each term or phrase.
Example 12-1. PlayIndexer using Commons Digester and Apache Lucene
package com.discursive.jccook.xml.bardsearch; import java.io.File; import java.io.IOException; import java.net.URL; import org.apache.commons.digester.Digester; import org.apache.commons.digester.xmlrules.DigesterLoader; import org.apache.log4j.Logger; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.xml.sax.SAXException; import com.discursive.jccook.util.LogInit; public class PlayIndexer { private static Logger logger = Logger.getLogger( PlayIndexer.class ); static { LogInit.init( ); } private IndexWriter indexWriter; private Digester digester; private DigestContext context; public PlayIndexer(IndexWriter pIndexWriter) { indexWriter = pIndexWriter; } public void init( ) { URL playRules = PlayIndexer.class.getResource("play-digester-rules.xml"); digester = DigesterLoader.createDigester( playRules ); } public void index(File playXml) throws IOException, SAXException { context = new DigestContext( ); digester.push( context ); digester.parse( playXml ); logger.info( "Parsed: " + playXml.getAbsolutePath( ) ); } public void processSpeech( ) { Document doc = new Document( ); doc.add(Field.Text("play", context.playTitle)); doc.add(Field.Text("act", context.actTitle)); doc.add(Field.Text("scene", context.sceneTitle)); doc.add(Field.Text("speaker", context.speaker)); doc.add(Field.Text("speech", new StringReader( context.speech.toString( ) ))); try { indexWriter.addDocument( doc ); } catch( IOException ioe ) { logger.error( "Unable to add document to index", ioe); } } public class DigestContext { File playXmlFile; String playTitle, actTitle, sceneTitle, speaker; StringBuffer speech = new StringBuffer( ); public void setActTitle(String string) { actTitle = string; } public void setPlayTitle(String string) { playTitle = string; } public void setSceneTitle(String string){ sceneTitle = string;} public void setSpeaker(String string) { speaker = string; } public void appendLine(String pLine) { speech.append( pLine ); } public void speechEnd( ) { processSpeech( ); speech.delete( 0, speech.length( ) ); speaker = ""; } } }
Example 12-1 used
a Digester rule set defined in Example 12-2. This set of
rules is designed to invoke a series of methods in a set sequence to
populate the context variables for each speech. The Digester rules in
Example 12-2 never push
or pop objects onto the digester stack; instead, the Digester is being
used to populate variables and invoke methods on an object that creates
Lucene Document
objects based on a
set of context variables. This example uses the Digester as a shorthand
Simple API for XML (SAX) parser; the PlayIndexer
contains a series of callback
methods, and the Digester rule set simplifies the interaction between
the underlying SAX parser and the DigestContext
.
Example 12-2. Digester rules for PlayIndexer
<?xml version="1.0"?> <digester-rules> <pattern value="PLAY"> <bean-property-setter-rule pattern="TITLE" propertyname="playTitle"/> <pattern value="ACT"> <bean-property-setter-rule pattern="TITLE" propertyname="actTitle"/> <pattern value="PROLOGUE"> <bean-property-setter-rule pattern="TITLE" propertyname="sceneTitle"/> <pattern value="SPEECH"> <bean-property-setter-rule pattern="SPEAKER" propertyname="speaker"/> <call-method-rule pattern="LINE" methodname="appendLine" paramtype="java.lang.String" paramcount="0"/> <call-method-rule methodname="speechEnd" paramtype="java.lang.Object"/> </pattern> </pattern> <pattern value="SCENE"> <bean-property-setter-rule pattern="TITLE" propertyname="sceneTitle"/> <pattern value="SPEECH"> <bean-property-setter-rule pattern="SPEAKER" propertyname="speaker"/> <call-method-rule pattern="LINE" methodname="appendLine" paramtype="java.lang.String" paramcount="0"/> <call-method-rule methodname="speechEnd" paramtype="java.lang.Object"/> </pattern> </pattern> </pattern> </pattern> </digester-rules>
In this recipe, an IndexWriter
was created with a SimpleAnalyzer
. An
Analyzer
takes a series of terms or
tokens and creates the terms to be indexed; different Analyzer
implementations are appropriate for
different applications. A SimpleAnalyzer
will keep every term in a piece
of text, discarding nothing. A StandardAnalyzer
is an Analyzer
that discards common English words
with little semantic value, such as "the," "a," "an," and "for." The
StandardAnalyzer
maintains a list of
terms to discard—a stop list
.
Cutting down on the number of terms indexed can save time and space in
an index, but it can also limit accuracy. For example, if one were to
use the StandardAnalyzer
to index the
play Hamlet, a search for "to be or not to be"
would return zero results, because every term in that phrase is a common
English word on StandardAnalyzer
's
stop list. In this recipe, a SimpleAnalyzer
is used because it keeps track
of the occurrence of every term in a document.
What you end up with after running this example is a directory
named index
, which contains files
used by Lucene to associate terms with documents. In this example, a
Lucene Document
consists of the
contextual information fully describing each speech—"play," "act,"
"scene," "speaker," and "speech." Field
objects are added to Document
objects using Document
's addDoc()
method. The processSpeech()
method in PlayIndexer
creates Lucene Document
objects that contain Field
s; Field
objects are created by calling Text( )
, a static method on Field
. The first parameter to Text( )
is the name of the field, and the second parameter is the
content to be indexed. Passing a String
as the second parameter to Text()
instructs the IndexWriter
to store the content of a field in
a Lucene index; a Field
created with
a String
can be displayed in a search
result. Passing a Reader
as the
second parameter to Text( )
instructs
the IndexWriter
not to store the
contents of a field, and the contents of a field created with a Reader
cannot
be returned in a search result. In the previous example, the "speech"
field is created with a Reader
to
reduce the size of the Lucene index, and every other Field
is created with a String
so that our search results can provide
a speech's contextual coordinates.
Sure, you've created a Lucene index, but how would you search it? The index created in this recipe can be searched with Lucene using techniques described in Recipe 12.7 and Recipe 12.8.
If you are indexing a huge database of English documents, consider
using the StandardAnalyzer
to discard
common English words. If you are indexing documents written in German or
Russian, Lucene ships with GermanAnalyzer
and RussianAnalyzer
, which both contain stop word
lists for these languages. For more information about these two
implementations of Analyzer
, see the
Lucene JavaDoc at http://lucene.apache.org/java/1_9_1/api/index.html.
For more information about Apache Lucene, see the Lucene project web site at http://lucene.apache.org/.
This recipe uses the The Plays of Shakespeare, compiled by Jon Bosak. To download the complete works of Shakespeare in XML format, see http://www.ibiblio.org/bosak/.