The DataCleanser is based upon patented techniques to solve this problem fast, with the particular desire to improve the accuracy of the match process for very large datasets.
For example, the equality of two records may not be specified easily as a simple arithmetic predicate, but rather by a set of rules. The user-specified rules determine when two distinct records (possibly from two different databases) provide information about the same entity. The DataCleanser provides a high-level declarative rule-based knowledge base that includes some handy primitive match functions that can be used directly in rules. This is a very important benefit to organizations that work under strict time constraints and have precious little time to experiment with alternative matching criteria specified in low level languages.
The DataCleanser works by first sorting the entire dataset, by user-defined keys extracted from the data, to bring the potentially matching records close together in a list. Rigorous studies have shown that no single pass over the data using one particular scheme as a key for sorting performs as well as combining the results of several independent runs each using a different key for ordering data. It has been demonstrated that multiple passes followed by the computation of the ``transitive closure'' consistently dominates in accuracy.
The moral is simply that several distinct ``cheap'' passes over the data produces more accurate results than one ``expensive'' pass over the data.
The DataCleanser's Graphical User Interface provides a convenient environment to design the best key for a particular application. Likewise, the GUI assists the user in defining rules.
Suppose two names are spelled nearly (but not) identically, and have the exact same address. We might infer they are the same person. On the other hand, suppose two records have exactly the same social security numbers, but the names and addresses are completely different. We could either assume the records represent the same person who changed his name and moved, or the records represent different persons, and the social security number field is incorrect for one of them. Without any further information, we may perhaps assume the later. The more information there is in the records, the better inferences can be made. For example, Michael Smith and Michele Smith could have the same address, and their names are ``reasonably close''. If gender and age information is available in some field of the data, we could perhaps infer that Michael and Michele are either married or siblings.
Here is a simplified rule in English that represents this kind of rule based reasoning:
Given two records, r1 and r2.
IF the last name of r1 equals the last name of r2,
AND the first names differ slightly,
AND the address of r1 equals the address of r2
THEN
r1 is equivalent to r2.
The implementation of differ slightly specified in this rule in English is implemented by a distance function applied to the first name fields of two records, and the comparison of its results to a threshold to capture obvious typographical errors that may occur in the data.
The DataCleanser provides a number of useful built in distance functions including string edit distance, phonetic distance and typewriter distance! Users can easily write their own.
It is important to note that rules can be written for a wide range of applications dealing with arbitrary data types. We chose to use string data in these examples (e.g., names, addresses) for pedagogical reasons (after all everyone gets ``faulty'' junk mail). We could equally as well demonstrate the concepts using alternative databases of different typed objects and correspondingly different rule sets. For example, the DataCleanser can be extended easily to clean image databases, where equality is defined with respect to image features, i.e. two images are equivalent if they both contain a common image feature, like color.
The following table displays records with a variety of errors that may commonly be found in mailing lists for direct marketing, for example. Indeed, poor implementations of DataCleansing tasks by some commercial organizations typically lead to several pieces of the same junk mail being mailed (at obviously far greater expense) to the same household (as nearly everyone has experienced). The adjacent pair of (same colored) records in this table are identified by the DataCleanser's rule base as equivalent!
| ID | Name (First, Initial, Last) | Address |
| 334600443 | Lisa Boardman | 144 Wars St |
| 334600443 | Lisa Brown | 144 Ward St. |
| 525520001 | Ramon Bonilla | 38 Ward St. |
| 525250001 | Raymond Bonilla | 38 Ward St. |
| 0 | Diana D. Ambrosion | 40 Brik Church Av. |
| 0 | Diana A. Dambrosion | 40 Brick Church Av. |
| 0 | Colette Johnen | 600 113th St. apt. 5a5 |
| 0 | John Colette | 600 113th St. ap. 585 |
| 850982319 | Ivette A Keegan | 23 Florida Av. |
| 950982319 | Yvette A Kegan | 23 Florida St. |