Use Apache Lucene to index your documents and obtain a TermEnum
using an IndexReader
.
The frequency of a term is defined as the number of documents in which a
specific term appears, and a TermEnum
object contains the frequency of every term in a set of documents. Example 12-3 iterates over the
terms contained in TermEnum
returning
every term that appears in more than 1,100 speeches.
Example 12-3. TermFreq finding the most frequent terms in an index
package com.discursive.jccook.xml.bardsearch; import java.util.ArrayList; import java.util.Collections; import java.util.Iterator; import java.util.List; import org.apache.commons.lang.builder.CompareToBuilder; import org.apache.log4j.Logger; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.TermEnum; import com.discursive.jccook.util.LogInit; public class TermFreq { private static Logger logger = Logger.getLogger(TermFreq.class); static { LogInit.init( ); } public static void main(String[] pArgs) throws Exception { logger.info("Threshold is 1100" ); Integer threshold = new Integer( 1100 ); IndexReader reader = IndexReader.open( "index" ); TermEnum enum = reader.terms( ); List termList = new ArrayList( ); while( enum.next( ) ) { if( enum.docFreq( ) >= threshold.intValue( ) && enum.term( ).field( ).equals( "speech" ) ) { Freq freq = new Freq( enum.term( ).text( ), enum.docFreq( ) ); termList.add( freq ); } } Collections.sort( termList ); Collections.reverse( termList ); System.out.println( "Frequency | Term" ); Iterator iterator = termList.iterator( ); while( iterator.hasNext( ) ) { Freq freq = (Freq) iterator.next( ); System.out.print( freq.frequency ); System.out.println( " | " + freq.term ); } } public static class Freq implements Comparable { String term; int frequency; public Freq( String term, int frequency ) { this.term = term; this.frequency = frequency; } public int compareTo(Object o) { if( o instanceof Freq ) { Freq oFreq = (Freq) o; return new CompareToBuilder( ) .append( frequency, oFreq.frequency ) .append( term, oFreq.term ) .toComparison( ); } else { return 0; } } } }
A Lucene index is opened by passing the name of the index
directory to IndexReader.open()
, and a TermEnum
is retrieved from the IndexReader
with a call to reader.terms()
. The previous example iterates
through every term contained in TermEnum
, creating and populating an instance
of the inner class Freq
, if a term
appears in more than 1,100 documents and the term occurs in the "speech"
field. TermEnum
contains three
methods of interest: next( )
,
docFreq( )
, and term( )
. next(
)
moves to the next term in the TermEnum
, returning false
if no more terms are available. docFreq( )
returns the number of documents a
term appears in, and term( )
returns
a Term
object containing the text of
the term and the field the term occurs in. The List
of Freq
objects is sorted by frequency and
reversed, and the most frequent terms in a set of Shakespeare plays is
printed to the console:
0 INFO [main] TermFreq - Threshold is 4500 Frequency | Term 2907 | i 2823 | the 2647 | and 2362 | to 2186 | you 1950 | of 1885 | a 1870 | my 1680 | is 1678 | that 1564 | in 1562 | not 1410 | it 1262 | s 1247 | me 1200 | for 1168 | be 1124 | this 1109 | but
From this list, it appears that the most frequent terms in
Shakespeare plays are inconsequential words, such as "the," "a," "of,"
and "be." The index this example was executed against was created with a
SimpleAnalyzer
that does not discard
any terms. If this index is created with StandardAnalyzer
, common articles and pronouns
will not be stored as terms in the index, and they will not show up on
the most frequent terms list. Running this example against an index
created with a StandardAnalyzer
and reducing the frequency threshold to 600 documents
returns the following results:
Frequency | Term 2727 | i 2153 | you 1862 | my 1234 | me 1091 | your 1057 | have 1027 | he 973 | what 921 | so 893 | his 824 | do 814 | him 693 | all 647 | thou 632 | shall 614 | lord