Names and addresses suffer from a skewed distribution. A few words
represent the majority of names, while large volumes of uncommon names
exist but occur infrequently. This is most dramatically illustrated
by the analysis of people's names in the United States. While there
are 2.5 million last names and over 3.2 million first names, three
hundred surnames represent thirty-five percent of the population, while
over sixty-five percent of the population has one of four hundred first
names. The skew and distribution of company names and street addresses
are just as extreme. Inquiries will usually possess a similar distribution
pattern as the name population in the database. Complicating the problems
of skew and distribution are the variations due to name frequency characteristics
in different geographical locations and the type of information stored
in the database.
Phonetic tokenization increases
the skewed distribution pattern of common names. For example, inquiries
on John also returns Joan, Jim,
Jane, Jimmy, Jenn and other names that fall into the "JAN" phonetic
pattern. By aggravating the skew in distribution of names both quality
and performance are sacrificed.

To combat this problem NameSearch® uses
frequency analysis routines to determine the extent of phonetic tokenization.
Generic frequency tables are supplied for the default
search services. Customized tables can be produced
by modifying the tables through the generation shell or by running
a representative sample
of names throng the generation shell's frequency analysis tool.
With an empty frequency table, phonetic tokenization is always applied.
The result of phonetically tokenizing every word is a degradation in
performance. This is due to the retrieval of larger sets caused by
fuzzier searches.
The extent of phonetic tokenization is determined by the degree of
confidence and the size of the database. For smaller databases the
decision becomes much easier. The number of records returned in an
inquiry grows linearly with the size of your database. For example,
if your database has one hundred thousand records and on average you
returned .005% you will display five records. A database with one million
rows will return fifty records and a database with ten million records
will return five hundred candidates. The smaller database can employ
more phonetic tokenization. It is easy to justify the extra expense
when only one or two more records are being returned. If you increase
the use of phonetic tokenization on a database with ten million entities
the cost of retrieving an additional couple of hundred records becomes
prohibitive. If the frequency tables are extensive, phonetic tokenization
will be disabled for the majority of inquiries. Your system will return
very tight matches but will be intolerant of phonetic variations. The
degree of phonetic tolerance is determined by the applications objectives.
Fraud investigators may prefer greater flexibility while users of a
customer information system will need a tighter criterion.
NameSearch® product
offers a variety of pre-built phonetic tokenization
routines and provides the capability to create
customized phonetic encoding scripts.
The standard routine is
the least tolerant of phonetic differences. The extended phonetic
routine over comes a greater variety of phonetic
errors and others that vary in there degree of phonetic tolerance.
The type of phonetic tokenization employed is determined by user
expectations, system requirements and database size. As phonetic
tokenization becomes
more general or tolerant the number of records returned will increase
as well as the opportunity for finding irrelevant records. The selection
of a phonetic routine is one of the options that can be set and test
by NameSearch® ’s Generation Shell.
General Information About Phonetics
Phonetic
coding NYSIIS VS. Soundex
NameSearch® General
Information