Intelligent Search Technology, Ltd. specializes in search and matching software.  Name Search our flagship product provides intelligence to both online and batch search and matching applications.  Name Search not only enables systems to find and match information based on personal and corporate names but also comes with powerful address searching and e-mail searching services.  Correct Address is address verification, validation and correction software harnesses the intelligence of the Name Search.  Name search also powers ISTwatch.  ISTwatch is terrorist checking software to enabling compliance with US patriots act.   Merlin Merge supplied with the name search is used for duplicate record identification and merge purge operations.

Phonetic coding aggravates skewed distribution


Names and addresses suffer from a skewed distribution. A few words represent the majority of names, while large volumes of uncommon names exist but occur infrequently. This is most dramatically illustrated by the analysis of people's names in the United States. While there are 2.5 million last names and over 3.2 million first names, three hundred surnames represent thirty-five percent of the population, while over sixty-five percent of the population has one of four hundred first names. The skew and distribution of company names and street addresses are just as extreme. Inquiries will usually possess a similar distribution pattern as the name population in the database. Complicating the problems of skew and distribution are the variations due to name frequency characteristics in different geographical locations and the type of information stored in the database.

Phonetic tokenization increases the skewed distribution pattern of common names. For example, inquiries on John also returns Joan, Jim, Jane, Jimmy, Jenn and other names that fall into the "JAN" phonetic pattern. By aggravating the skew in distribution of names both quality and performance are sacrificed.

To combat this problem NameSearch® uses frequency analysis routines to determine the extent of phonetic tokenization. Generic frequency tables are supplied for the default search services. Customized tables can be produced by modifying the tables through the generation shell or by running a representative sample of names throng the generation shell's frequency analysis tool.

With an empty frequency table, phonetic tokenization is always applied. The result of phonetically tokenizing every word is a degradation in performance. This is due to the retrieval of larger sets caused by fuzzier searches.

The extent of phonetic tokenization is determined by the degree of confidence and the size of the database. For smaller databases the decision becomes much easier. The number of records returned in an inquiry grows linearly with the size of your database. For example, if your database has one hundred thousand records and on average you returned .005% you will display five records. A database with one million rows will return fifty records and a database with ten million records will return five hundred candidates. The smaller database can employ more phonetic tokenization. It is easy to justify the extra expense when only one or two more records are being returned. If you increase the use of phonetic tokenization on a database with ten million entities the cost of retrieving an additional couple of hundred records becomes prohibitive. If the frequency tables are extensive, phonetic tokenization will be disabled for the majority of inquiries. Your system will return very tight matches but will be intolerant of phonetic variations. The degree of phonetic tolerance is determined by the applications objectives. Fraud investigators may prefer greater flexibility while users of a customer information system will need a tighter criterion.

NameSearch® product offers a variety of pre-built phonetic tokenization routines and provides the capability to create customized phonetic encoding scripts.

The standard routine is the least tolerant of phonetic differences. The extended phonetic routine over comes a greater variety of phonetic errors and others that vary in there degree of phonetic tolerance. The type of phonetic tokenization employed is determined by user expectations, system requirements and database size. As phonetic tokenization becomes more general or tolerant the number of records returned will increase as well as the opportunity for finding irrelevant records. The selection of a phonetic routine is one of the options that can be set and test by NameSearch® ’s Generation Shell.

General Information About Phonetics
Phonetic coding NYSIIS VS. Soundex

NameSearch® General Information

 
 

Home | Partners | Downloads | Contact us

To find out more, call (800) 287-0412 or (845) 278-8989
Copyright © 1993-2003 Intelligent Search Technology Ltd.