Use an IndexSearcher to search a Lucene index created with
IndexWriter. This recipe assumes that you have created a Lucene index
using the techniques shown in the previous recipe. The constructor of
IndexSearcher
takes the name of a
directory that contains a Lucene index. A Query
object can be created by passing a
String
query, a default search field,
and an Analyzer
to QueryParser.parse()
. The following example searches the Lucene
index created in the previous recipe
for all speeches containing the term "Ophelia":
import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.Searcher; logger.info("Searching for Ophelia"); Searcher searcher = new IndexSearcher("index"); Analyzer analyzer = new SimpleAnalyzer( ); Query query = QueryParser.parse("Ophelia", "speech", analyzer); Hits hits = searcher.search(query); logger.info( "Searching Done, hit: " + hits.length( ) ); System.out.println( "Score | Play | Act | Scene | Speaker" ); for( int i = 0; i < hits.length( ); i++ ) { Document doc = hits.doc(i); System.out.print( (int) (hits.score(i) * 100 ) ); System.out.print( " | " + doc.get("play") ); System.out.print( " | " + doc.get("act") ); System.out.print( " | " + doc.get("scene") ); System.out.print( " | " + doc.get("speaker") + "\n" ); }
An IndexSearcher
is created by
passing in the name of the directory containing the Lucene index to its
constructor. Next, an Analyzer
is created that will analyze the query String
.
It is important at this stage to use the same Anaylzer
implementation that was used to
create the Lucene index to be searched, and, in this case, a SimpleAnalyzer
is used. If you use an
Analyzer
, which discards the words
"to," "be," "or," and "not," and then try to create a Query
for "to be or not to be," you are not
going to find the appropriate speech in Hamlet
because the Analyzer
you used to
parse your query dropped every term.
QueryParser
parses the query
string and creates a Query
object
that will search the "speech" field of each Document
in the index. The example then calls
searcher.search( )
and iterates
through Document
objects contained in
an instance of Hits
. Hits
contains a List
of Document
objects and a relevance score for
each result; a relevance score is a number between 1.00 and 0.00 that
tells you how strongly a particular Document
matches a particular query. The more
a term occurs in a speech, the more relevant the speech is, and the
closer the relevance score is to 1. The previous example returns every
occurrence of the term "Ophelia" in every Shakespeare play, and, from
the results, it is clear that Ophelia is a character in
Hamlet. Every occurrence of Ophelia is listed with
the relevance, play, act, scene, and speaker:
1 INFO [main] TermSearch - Searching for Ophelia 321 INFO [main] TermSearch - Searching Done, hit: 19 Score | Play | Act | Scene | Speaker 100 | Hamlet | ACT IV | SCENE V | QUEEN GERTRUDE 100 | Hamlet | ACT IV | SCENE V | KING CLAUDIUS 81 | Hamlet | ACT IV | SCENE V | QUEEN GERTRUDE 81 | Hamlet | ACT V | SCENE I | HAMLET 58 | Hamlet | ACT I | SCENE III | LORD POLONIUS 58 | Hamlet | ACT II | SCENE I | LORD POLONIUS 50 | Hamlet | ACT I | SCENE III | LAERTES 33 | Hamlet | ACT V | SCENE I | HAMLET 25 | Hamlet | ACT III | SCENE I | QUEEN GERTRUDE 24 | Hamlet | ACT III | SCENE I | LORD POLONIUS 22 | Hamlet | ACT IV | SCENE VII | LAERTES 21 | Hamlet | ACT III | SCENE I | KING CLAUDIUS 17 | Hamlet | ACT IV | SCENE V | LAERTES 17 | Hamlet | ACT II | SCENE II | LORD POLONIUS 16 | Hamlet | ACT III | SCENE I | LORD POLONIUS 14 | Hamlet | ACT II | SCENE II | LORD POLONIUS 13 | Hamlet | ACT I | SCENE III | LORD POLONIUS 11 | Hamlet | ACT I | SCENE III | LAERTES 11 | Hamlet | ACT III | SCENE I | HAMLET
A Query
can also search for
multiple terms in a specific order because a Lucene index keeps track of
the order relationships between terms within a Document
. Searching for the famous "to be or
not to be" returns a single match from Act III, Scene I of
Hamlet:
0 [main] INFO TermSearch - Searching for 'speech:"to be or not to be"' 354 [main] INFO TermSearch - Searching Done, hit: 1 Score | Play | Act | Scene | Speaker 100 | Hamlet | ACT III | SCENE I | HAMLET
This search was only possible because the SimpleAnalyzer
is used during index creation
and query parsing. If a different Analyzer
had been used to create the index,
the Lucene index would not be storing information about such common
words as "to," "be," "or," and "not." It is a common practice for
general search engines to discard very common terms such as "the," "a,"
"an," or "when." Discarding unimportant terms can reduce the size of an
index remarkably, but if you need to search for "to be or not to be,"
you will need to preserve all terms.
Both of the previous examples executed queries in about 300
milliseconds on a very cheap 2.0 GHz Celeron eMachine. This search would
have taken orders of magnitude longer to execute if every document had
to be parsed and searched in response to a query. The only reason a
full-text search can be completed in a few hundred milliseconds is the
presence of a Lucene index. The Lucene index provides a database of
Documents indexed by term, and an IndexSearcher
is essentially retrieving
objects from this database.
A Lucene query can combine multiple criteria, search for terms
matching wildcards, and find documents by multiple fields. A specific
field can be searched by prefixing a term with the field name and a
colon; for example, to search for documents in a certain play, one would
use the query play:"Hamlet"
. The
second parameter to QueryParser.parse(
)
is the default field for a query, and, in the previous
example, the default field is "speech." This means that a term without a
field qualifier will match the "speech" field. Table 12-1 lists
some possible Lucene queries and describes the results they
would return.
Table 12-1. A survey of Lucene queries
Query |
Description |
---|---|
play:"Hamlet" |
Returns all documents with a "play" field matching the string "Hamlet" |
"to be" AND "not to be" |
Returns a document with a "speech" field containing the strings "to be" and "not to be" |
play:"Hamlet" AND ("Polonius" OR "Hamlet") |
Returns all documents with a "play" field matching "Hamlet" with a "speech" field that contains the terms "Polonius" or "Hamlet" |
s*ings |
Returns all documents with a "speech" field containing a term that starts with "s" and ends in "ings"; includes terms such as "strings" and "slings" |
L?ve |
Returns all documents with a "speech" field containing terms such as "Love" or "Live" |
"slings" NOT "arrows" |
Returns documents with a "speech" field that contains "slings" but not "arrows" |
The following Lucene query finds documents containing "Saint Crispin" and "England" or "to be or not to be" and "slings and arrows":
("Saint Crispin" AND "England") OR ("to be or not to be" AND ("slings and arrows") )
When this query is executed against the Lucene index used in the
previous two recipes, two speeches are returned—a rousing battle speech
from <Henry V>
and Hamlet's existential rant. Running this
query would produce the following output:
0 [main] INFO TermSearch - Searching for ("Saint Crispin" AND "England") OR ("to be or not to be" AND ("slings and arrows") ) 406 [main] INFO TermSearch - Searching Done, hit: 2 Score | Play | Act | Scene | Speaker 31 | Hamlet | ACT III | SCENE I. A room in the castle. | HAMLET 11 | Henry V | ACT IV | SCENE III. The English camp. | KING HENRY V