Common Java Cookbook

Edition: 0.19

Download PDF or Read on Scribd

Download Examples (ZIP)

12.7. Searching for a Specific Term in a Document Index

12.7.1. Problem

You need to identify which documents in a Lucene index contain specific terms or phrases.

12.7.2. Solution

Use an IndexSearcher to search a Lucene index created with IndexWriter. This recipe assumes that you have created a Lucene index using the techniques shown in the previous recipe. The constructor of IndexSearcher takes the name of a directory that contains a Lucene index. A Query object can be created by passing a String query, a default search field, and an Analyzer to QueryParser.parse() . The following example searches the Lucene index created in the previous recipe for all speeches containing the term "Ophelia":

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
logger.info("Searching for Ophelia");
Searcher searcher = new IndexSearcher("index");
Analyzer analyzer = new SimpleAnalyzer( );
Query query = QueryParser.parse("Ophelia", "speech", analyzer);
Hits hits = searcher.search(query);
logger.info( "Searching Done, hit: " + hits.length( ) );
System.out.println( "Score | Play | Act | Scene | Speaker" );
        
for( int i = 0; i < hits.length( ); i++ ) {
    Document doc = hits.doc(i);
    System.out.print( (int) (hits.score(i) * 100 ) );
    System.out.print( " | " + doc.get("play") );
    System.out.print( " | " + doc.get("act") );
    System.out.print( " | " + doc.get("scene") );
    System.out.print( " | " + doc.get("speaker") + "\n" );
}

An IndexSearcher is created by passing in the name of the directory containing the Lucene index to its constructor. Next, an Analyzer is created that will analyze the query String.

Warning

It is important at this stage to use the same Anaylzer implementation that was used to create the Lucene index to be searched, and, in this case, a SimpleAnalyzer is used. If you use an Analyzer, which discards the words "to," "be," "or," and "not," and then try to create a Query for "to be or not to be," you are not going to find the appropriate speech in Hamlet because the Analyzer you used to parse your query dropped every term.

QueryParser parses the query string and creates a Query object that will search the "speech" field of each Document in the index. The example then calls searcher.search( ) and iterates through Document objects contained in an instance of Hits. Hits contains a List of Document objects and a relevance score for each result; a relevance score is a number between 1.00 and 0.00 that tells you how strongly a particular Document matches a particular query. The more a term occurs in a speech, the more relevant the speech is, and the closer the relevance score is to 1. The previous example returns every occurrence of the term "Ophelia" in every Shakespeare play, and, from the results, it is clear that Ophelia is a character in Hamlet. Every occurrence of Ophelia is listed with the relevance, play, act, scene, and speaker:

1   INFO [main] TermSearch     - Searching for Ophelia
321 INFO [main] TermSearch     - Searching Done, hit: 19
Score | Play   | Act     | Scene     | Speaker
100   | Hamlet | ACT IV  | SCENE V   | QUEEN GERTRUDE
100   | Hamlet | ACT IV  | SCENE V   | KING CLAUDIUS
81    | Hamlet | ACT IV  | SCENE V   | QUEEN GERTRUDE
81    | Hamlet | ACT V   | SCENE I   | HAMLET
58    | Hamlet | ACT I   | SCENE III | LORD POLONIUS
58    | Hamlet | ACT II  | SCENE I   | LORD POLONIUS
50    | Hamlet | ACT I   | SCENE III | LAERTES
33    | Hamlet | ACT V   | SCENE I   | HAMLET
25    | Hamlet | ACT III | SCENE I   | QUEEN GERTRUDE
24    | Hamlet | ACT III | SCENE I   | LORD POLONIUS
22    | Hamlet | ACT IV  | SCENE VII | LAERTES
21    | Hamlet | ACT III | SCENE I   | KING CLAUDIUS
17    | Hamlet | ACT IV  | SCENE V   | LAERTES
17    | Hamlet | ACT II  | SCENE II  | LORD POLONIUS
16    | Hamlet | ACT III | SCENE I   | LORD POLONIUS
14    | Hamlet | ACT II  | SCENE II  | LORD POLONIUS
13    | Hamlet | ACT I   | SCENE III | LORD POLONIUS
11    | Hamlet | ACT I   | SCENE III | LAERTES
11    | Hamlet | ACT III | SCENE I   | HAMLET

A Query can also search for multiple terms in a specific order because a Lucene index keeps track of the order relationships between terms within a Document. Searching for the famous "to be or not to be" returns a single match from Act III, Scene I of Hamlet:

0 [main] INFO TermSearch  - Searching for 'speech:"to be or not to be"'
354 [main] INFO TermSearch  - Searching Done, hit: 1
Score | Play   | Act     | Scene   | Speaker
100   | Hamlet | ACT III | SCENE I | HAMLET

This search was only possible because the SimpleAnalyzer is used during index creation and query parsing. If a different Analyzer had been used to create the index, the Lucene index would not be storing information about such common words as "to," "be," "or," and "not." It is a common practice for general search engines to discard very common terms such as "the," "a," "an," or "when." Discarding unimportant terms can reduce the size of an index remarkably, but if you need to search for "to be or not to be," you will need to preserve all terms.

12.7.3. Discussion

Both of the previous examples executed queries in about 300 milliseconds on a very cheap 2.0 GHz Celeron eMachine. This search would have taken orders of magnitude longer to execute if every document had to be parsed and searched in response to a query. The only reason a full-text search can be completed in a few hundred milliseconds is the presence of a Lucene index. The Lucene index provides a database of Documents indexed by term, and an IndexSearcher is essentially retrieving objects from this database.

A Lucene query can combine multiple criteria, search for terms matching wildcards, and find documents by multiple fields. A specific field can be searched by prefixing a term with the field name and a colon; for example, to search for documents in a certain play, one would use the query play:"Hamlet". The second parameter to QueryParser.parse( ) is the default field for a query, and, in the previous example, the default field is "speech." This means that a term without a field qualifier will match the "speech" field. Table 12-1 lists some possible Lucene queries and describes the results they would return.

Table 12-1. A survey of Lucene queries

Query

Description

play:"Hamlet"

Returns all documents with a "play" field matching the string "Hamlet"

"to be" AND "not to be"

Returns a document with a "speech" field containing the strings "to be" and "not to be"

play:"Hamlet" AND ("Polonius" OR "Hamlet")

Returns all documents with a "play" field matching "Hamlet" with a "speech" field that contains the terms "Polonius" or "Hamlet"

s*ings

Returns all documents with a "speech" field containing a term that starts with "s" and ends in "ings"; includes terms such as "strings" and "slings"

L?ve

Returns all documents with a "speech" field containing terms such as "Love" or "Live"

"slings" NOT "arrows"

Returns documents with a "speech" field that contains "slings" but not "arrows"


The following Lucene query finds documents containing "Saint Crispin" and "England" or "to be or not to be" and "slings and arrows":

("Saint Crispin" AND "England") OR 
("to be or not to be" AND ("slings and arrows") )

When this query is executed against the Lucene index used in the previous two recipes, two speeches are returned—a rousing battle speech from and Hamlet's existential rant. Running this query would produce the following output:

0 [main] INFO TermSearch  - Searching for ("Saint Crispin" AND "England") OR 
("to be or not to be" AND ("slings and arrows") )
406 [main] INFO TermSearch  - Searching Done, hit: 2
Score | Play    | Act     | Scene                           | Speaker
31    | Hamlet  | ACT III | SCENE I.  A room in the castle. | HAMLET
11    | Henry V | ACT IV  | SCENE III.  The English camp.   | KING HENRY V

Creative Commons License
Common Java Cookbook by Tim O'Brien is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Permissions beyond the scope of this license may be available at http://www.discursive.com/books/cjcook/reference/jakartackbk-PREFACE-1.html. Copyright 2009. Common Java Cookbook Chunked HTML Output. Some Rights Reserved.