Phonetic Coding Algorithms, Soundex and NYSIIS
Phonetic coding, generically referred to
as "Soundex",
is often used to enable retrieval of information from data processing
systems. R. C. Russell developed the Soundex algorithm to processes
data collected
from the 1890 census. Known as the Russell Soundex algorithm numerous
variants have been employed for genealogy studies and retrieval systems.
New York State Identification and Intelligence Algorithm (NYSIIS)
In 1970 the New York State Identification and Intelligence project
headed by Robert L. Taft published the paper "Name Search Techniques".
In this paper he compared Soundex with a new phonetic routine
(NYSIIS) that was designed through rigorous empirical analysis.
The NYSIIS project concluded that:
NYSIIS is 98.72% accurate with a selectivity factor of .164%
per name inquiry.
Soundex is 95.99% accurate with a selectivity factor of .213%
per name inquiry.
Selectivity is defended by the number of records returned by
the size of the data set.
* In 1998 the New York State Division
of Criminal Justice the agency responsible for the NYSIIS project replaced
the
NYSIIS engine with NameSearch®
NameSearch's intelligent phonetic routines have been
proven to have increased accuracy while decreasing selectivity as compared
to NYSIIS.
Traditional solutions such as Soundex and NYSIIS used
for solving name variations only deal with phonetic errors. These
solutions involved the
standardization of easily confused sounds. For example, PH's would
be treated as F's. Linguistic rules were generated to phonetically
tokenize a name.
These phonetically tokenized words served as the basis for name
retrieval. In some instances these rules helped find names that
were hard to spell,
unfortunately, the distribution pattern of common names became
even more skewed. For example, inquiries on John also returned
Joan, Jim, Jane, Jimmy,
Jenn and other names which fell in the "JAN" phonetic pattern.
By aggravating the skew in distribution of names both quality and
performance were sacrificed.
Discrepancies caused by phonetic errors account for twenty to twenty five
percent of all name variations. Intelligent Search Technology addresses
problems due to phonetics by employing analysis routines to determine the
extent of phonetic tokenization. This enables NameSearch to overcome problems
due to phonetics without the negative consequences incurred with all other
methods of name search.
Additional information on phonetic encoding:
Problems caused
by phonetic skewed distribution
Phonetic coding NYSIIS VS. Soundex
NameSearch® General Information
|