DataCleansing in Action!

The naive way of cleaning data is to compute a Duplicate Elimination Sort, or run a full comparison of each record to every other. These strategies, however, assume a total ordering of the data exists and the data is not corrupted. In many real world datasets we cannot assume there exists a total ordering, nor a simple duplicate-matching process, meaning even slight errors in the data imply all possible ``matches'' of common data about the same entity may not be found.

The DataCleanser is based upon patented techniques to solve this problem fast, with the particular desire to improve the accuracy of the match process for very large datasets.

For example, the equality of two records may not be specified easily as a simple arithmetic predicate, but rather by a set of rules. The user-specified rules determine when two distinct records (possibly from two different databases) provide information about the same entity. The DataCleanser provides a high-level declarative rule-based knowledge base that includes some handy primitive match functions that can be used directly in rules. This is a very important benefit to organizations that work under strict time constraints and have precious little time to experiment with alternative matching criteria specified in low level languages.

The DataCleanser works by first sorting the entire dataset, by user-defined keys extracted from the data, to bring the potentially matching records close together in a list. Rigorous studies have shown that no single pass over the data using one particular scheme as a key for sorting performs as well as combining the results of several independent runs each using a different key for ordering data. It has been demonstrated that multiple passes followed by the computation of the ``transitive closure'' consistently dominates in accuracy.

The moral is simply that several distinct ``cheap'' passes over the data produces more accurate results than one ``expensive'' pass over the data.

The Basic DataCleansing Method

Given a collection of two or more databases, we first concatenate them into one sequential list of records and then apply the sorted-neighborhood method. The sorted-neighborhood method for DataCleansing can be summarized in three phases:
Create Keys
Compute a key for each record in the list by extracting relevant fields or portions of fields.
Sort Data
Sort the records in the data list using the key
Merge
Move a fixed-sized window through the sequential list of records limiting the comparisons for matching records to those records in the window.

The DataCleanser's Graphical User Interface provides a convenient environment to design the best key for a particular application. Likewise, the GUI assists the user in defining rules.


Let's look at an example of the kind of rules the user may have in mind.

Suppose two names are spelled nearly (but not) identically, and have the exact same address. We might infer they are the same person. On the other hand, suppose two records have exactly the same social security numbers, but the names and addresses are completely different. We could either assume the records represent the same person who changed his name and moved, or the records represent different persons, and the social security number field is incorrect for one of them. Without any further information, we may perhaps assume the later. The more information there is in the records, the better inferences can be made. For example, Michael Smith and Michele Smith could have the same address, and their names are ``reasonably close''. If gender and age information is available in some field of the data, we could perhaps infer that Michael and Michele are either married or siblings.

Here is a simplified rule in English that represents this kind of rule based reasoning:

Given two records, r1 and r2.
IF the last name of r1 equals the last name of r2,
AND the first names differ slightly,
AND the address of r1 equals the address of r2
THEN
r1 is equivalent to r2.

The implementation of differ slightly specified in this rule in English is implemented by a distance function applied to the first name fields of two records, and the comparison of its results to a threshold to capture obvious typographical errors that may occur in the data.

The DataCleanser provides a number of useful built in distance functions including string edit distance, phonetic distance and typewriter distance! Users can easily write their own.

It is important to note that rules can be written for a wide range of applications dealing with arbitrary data types. We chose to use string data in these examples (e.g., names, addresses) for pedagogical reasons (after all everyone gets ``faulty'' junk mail). We could equally as well demonstrate the concepts using alternative databases of different typed objects and correspondingly different rule sets. For example, the DataCleanser can be extended easily to clean image databases, where equality is defined with respect to image features, i.e. two images are equivalent if they both contain a common image feature, like color.

The following table displays records with a variety of errors that may commonly be found in mailing lists for direct marketing, for example. Indeed, poor implementations of DataCleansing tasks by some commercial organizations typically lead to several pieces of the same junk mail being mailed (at obviously far greater expense) to the same household (as nearly everyone has experienced). The adjacent pair of (same colored) records in this table are identified by the DataCleanser's rule base as equivalent!

IDName (First, Initial, Last) Address
334600443 Lisa Boardman 144 Wars St
334600443 Lisa Brown 144 Ward St.
525520001 Ramon Bonilla 38 Ward St.
525250001 Raymond Bonilla 38 Ward St.
0 Diana D. Ambrosion 40 Brik Church Av.
0 Diana A. Dambrosion 40 Brick Church Av.
0 Colette Johnen 600 113th St. apt. 5a5
0 John Colette 600 113th St. ap. 585
850982319 Ivette A Keegan 23 Florida Av.
950982319 Yvette A Kegan 23 Florida St.
Example matching records detected by the DataCleanser.

Electronic Digital Documents, Inc.
edd@npsa.com