This document briefly describes the architecture of a DataCleansing DataBlade Module. The DataBlade Module includes a client-side GUI to guide the user through the process of tailoring their DataCleansing process and a set of server-side tables and functions to implement the user-defined Cleansing procedure.
| Figure 1 : A general-purpose Merge/Purge architecture |
The above figure displays the overall architecture of the general-purpose DataCleanser. The middle Merge/Purge box represents the routines to implement the patented "multi-pass" approach of the sorted-neighborhood method including the code for the sort, window-scan, and transitive-closure phases. The DataCleanser inputs one or more table(s) of records selected by the user (possibly from a multi-database system). It is assumed that the possible relevant records as well as any necessary schema integration are performed by the user prior to the use of the DataCleanser.
In this section we describe our design for the DataCleanser DataBlade Module. We start with the client-side GUI and provide an idea of the information we need to save at the server-side to later execute the Cleansing application. We then move to the server-side and provide an example of the interface we plan to provide to access the DataCleansing functions.
A client-side Graphical User Interface will guide a DataCleanser user through the process of preparing a Cleansing application. The information entered with the GUI will be stored on server-side tables for later use by server-side DataBlade functions.
A prototype implementation of this GUI was implemented with Tcl/Tk and demonstrates the basic process to set up a DataCleansing application. Figure 2 shows a screen dump of the prototype GUI. Our final client GUI shipped with the product will have facilities to
![]() |
| Figure 2 : Prototype GUI (Reduced. Click to enlarge) |
Most of the data entered with the GUI will be stored in tables that can be later accessed by the server to run a particular DataCleansing application. This allows the user to define a multitude of DataCleansing applications, each tailored to a different task. Here is a sketch of the schema for the database the DataCleanser's GUI will create to store its information:
(There will be a row in this table for each Data Cleansing application
defined with the GUI. name provides a unique identifier for each
DataCleansing application.)
(This table stores the information of each independent sorted-neighborhood
pass. The foreign key mergeInfo links each entry in this table
to the proper MergeInformation entry.)
(This table contains the information to perform the appropriate
reorganization of the data during the sort phase.)
(This table holds the rule-based equational theory. Rules will be
stored here and installed in the server when a Data Cleansing
application is run.)
Suppose we want to create a DataCleansing application that will remove duplicates from a mailing list. Assume the mailing list is stored in one table called Person whose data schema is the following:
After testing several possibilities, we decide two (2) independent passes will be necessary in our DataCleansing application. On the first pass, we will use the SSN field followed by the FirstName field followed by the LastName field as the sorting key. For the second pass, we will use Zipcode followed by LastName and City. We will use the same rules to find similar records and use a window of 10 records during the window scan phase. We call this application the 'Clean Person' Cleansing application.
When this information is sent from the GUI to the server, a new row will be added to MergeInformation containing the following information:
(I.e., a new Cleansing application named 'Clean Persons' using 2 passes.) Two rows describing each pass (the sorting key used for each pass, the window size, and the rules) are then inserted into RunInformation.
The KeyDescription table will store the sorting keys as follows:
Initially, the basic functionality of the DataCleanser DataBlade Module will be encapsulated into two server-side procedures. The first one, EDDMerge will be a query-level callable function that takes a Cleansing application name and a set of rows over which to operate and returns a set of rows with a special field indicating which rows are similar. The second procedure is an aggregate function, EDDPurge that calls a client-side function to perform the actual purge of records. As we implement this initial interface, we are sure to encounter the need for more functions and procedures or several flavors of the same function (see the Summary section for a discussion of these issues).
The EDDMerge() function takes two arguments. The first argument must be the name of an already created DataCleansing application (i.e., the name must appear in the MergeInformation table). The second argument is a set of rows over which to apply the DataCleansing procedure. The return value of EDDMerge() is another set of rows whose data schema includes all fields of the input set of rows plus an additional field named EDDMergeResult. If the input set of rows belong to the Person table from the example in the previous section,
then the output data schema will be the following:
This additional integer field will contain a number that will map each row into clusters or groups of similar rows as determined by the rule-program. In other words, all rows with identical EDDMergeResult values belong to the same "logical group" and were determined to be similar.
To use EDDMerge() from an SQL-query simply include the function call in the FROM clause as follows:
Of course, we are usually interested in using EDDMergeResult to find the resulting clusters of similar rows. To accomplish this, simply GROUP BY the value of EDDMergeResult as follows:
Notice that in the above query we could use ORDER instead of GROUP to obtain the same information. The reason for using GROUP will become apparent in the next section.
Although a DataCleansing application can end after the results of EDDMerge() are obtained, many applications require the further processing of the information in the computed clusters. For this purpose, the DataCleanser DataBlade Module will have an EDDPurge() aggregate function which will pass the resulting clusters of rows to a user-defined function. This user-defined function will then be in charge of deciding what to do with the similar rows. A simple procedure is to delete all but one of the passed rows effectively removing duplicates from the input data set. Alternatively, one may choose to concatenate various fields from the individual records to accumulate more information about an individual that was previously distributed amongst different datasets.
Following the mailing list example used throughout this document, a final Purge phase could be called from an SQL-query as follows:
We have described the basic capabilities and design of the DataCleanser DataBlade Module. We are presently developing a client-side user interface to allow the user to enter all the parameters of a DataCleansing application in a convenient and easy-to-use fashion. Then, a simple interface of functions and aggregates will allow the use of the DataCleansing facilities directly from an SQL-query. We limited the discussion of possible server-side functions to two procedures: one to take care of merging data and another to take care of purging data. We expect the DataCleanser DataBlade to contain more functions than these two, most of them providing the same functionality but with a different interface. For instance, in the previous section we alluded to the removal of all duplicates as a possible task for EDDPurge(). However, an aggregate function is not necessarily the most efficient way to implement this. In many cases all you will need to do is read the first row from a group and ignore the rest. However, Illustra's aggregates does not allow an early interruption of its aggregate cycle meaning all rows in a group are visited (no matter if the wanted row was found early in the processing cycle). Thus, a better solution for such cases could be an EDDPurge() function that could be used in the same fashion as EDDMerge. For example,
Alternatively the entire DataCleansing procedure may be encapsulated in one function that can be called directly from the client-side monitor as follows: