Use Commons Codec's Soundex
.
Supply a surname or a word, and Soundex
will produce a phonetic
encoding:
// Required import declaration import org.apache.commons.codec.language.Soundex; // Code body Soundex soundex = new Soundex( ); String obrienSoundex = soundex.soundex( "O'Brien" ); String obrianSoundex = soundex.soundex( "O'Brian" ); String obryanSoundex = soundex.soundex( "O'Bryan" ); System.out.println( "O'Brien soundex: " + obrienSoundex ); System.out.println( "O'Brian soundex: " + obrianSoundex ); System.out.println( "O'Bryan soundex: " + obryanSoundex );
This will produce the following output for three similar surnames:
O'Brien soundex: O165 O'Brian soundex: O165 O'Bryan soundex: O165
Soundex.soundex( )
takes a
string, preserves the first letter as a letter code, and proceeds to
calculate a code based on consonants contained in a string. So, names
such as "O'Bryan," "O'Brien," and "O'Brian," all being common variants
of the Irish surname, are given the same encoding: "O165." The 1
corresponds to the B, the 6 corresponds to the R, and the 5 corresponds
to the N; vowels are discarded from a string before the Soundex
code is generated.
The Soundex
algorithm can be
used in a number of situations, but Soundex
is usually associated with surnames,
as the United States historical census records are indexed using
Soundex
. In addition to the role
Soundex
plays in the census, Soundex
is also used in the health care
industry to index medical records and report statistics to the
government. A system to access individual records should allow a user to
search for a person by the Soundex
code of a surname. If a user types in the name "Boswell" to search for a
patient in a hospital, the search result should include patients named
"Buswell" and "Baswol;" you can use Soundex
to provide this capability if an
application needs to locate individuals by the sound of a
surname.
The Soundex
of a word or name
can also be used as a primitive method to find out if two small words
rhyme. Commons Codec contains other phonetic encodings, such as RefinedSoundex
, Metaphone
, and DoubleMetaphone
. All of these alternatives
solve similar problems—capturing the phonemes or
sounds contained in a word.
For more information on the Soundex
encoding, take a look at the
Dictionary of Algorithms and Data Structures at the National Institute
of Standards and Technology (NIST), http://www.nist.gov/dads/HTML/soundex.html.
There you will find links to a C implementation of the Soundex
algorithm.
For more information about alternatives to Soundex
encoding, read "The Double Metaphone
Search Algorithm" by Lawrence Philips (http://www.cuj.com/documents/s=8038/cuj0006philips/).
Or take a look at one of Lawrence Philips's original Metaphone algorithm
implementations at http://aspell.sourceforge.net/metaphone/.
Both the Metaphone and Double Metaphone algorithms capture the sound of
an English word; implementations of these algorithms are available in
Commons Codec as Metaphone
and
DoubleMetaphone
.