Phonetic coding, generically referred to as "Soundex", is
often used to enable retrieve of information from data processing systems.
R. C. Russell developed the Soundex algorithm to processes data collected
from the 1890 census. Known as the Russell Soundex algorithm numerous
variants have been employed for genealogy studies and retrieval systems.
New York State Identification and Intelligence Algorithm (NYSIIS)
In 1970 the New York State Identification and Intelligence project headed by
Robert L. Taft published the paper "Name Search Techniques". In
this paper he compared Soundex with a new phonetic routine (NYSIIS) that
was designed through rigorous empirical analysis.
The NYSIIS project concluded that:
NYSIIS is 98.72% accurate with a selectivity factor of .164% per name
inquiry.
Soundex is 95.99% accurate with a selectivity factor of .213% per name inquiry.
Selectivity is defended by the number of records returned by the size of the
data set.
* In 1998 the New York State Division of Criminal Justice
the agency responsible for the NYSIIS project replaced the NYSIIS
engine with
NameSearch®
NameSearch's intelligent phonetic routine have been proven to have
increased accuracy while decreasing selectivity as compared to NYSIIS.
Traditional solutions such as Soundex and NYSIIS used
for solving name variations only deal with phonetic errors. These solutions
involved
the standardization
of easily confused sounds. For example, PH's would be treated as F's. Linguistic
rules were generated to phonetically tokenize a name. These phonetically tokenized
words served as the basis for name retrieval. In some instances these rules helped
find names that were hard to spell, unfortunately, the distribution pattern
of common names became even more skewed. For example, inquiries on John also
returned Joan, Jim, Jane, Jimmy, Jenn and other names which fell in the "JAN" phonetic
pattern. By aggravating the skew in distribution of names both quality and performance
were sacrificed.
Discrepancies caused by phonetic errors account for twenty to twenty
five percent of all name variations. Intelligent Search Technology
addresses problems due to phonetics by employing analysis routines
to determine the extent of phonetic tokenization. This enables NameSearch
to overcome problems due to phonetics without the negative consequences
incurred with all other methods of name search.
Additional
information on phonetic encoding:
Problems
cause by phonetic skewed distribution
Phonetic
coding NYSIIS VS. Soundex
NameSearch® General Information