REAL-WORLD DATA IS DIRTY!

The problem of linking and merging common information in a database is a vexing problem for large commercial and government organizations. Large repositories of data always have numerous duplicate records about the same entities that are difficult to cull together. The DataCleanser patented technology provides a convenient way to identify equivalent items by an easy to use intelligent matching process.

The fundamental problem is that frequently data from different sources are either different in form or simply erroneous due to typographical or transcription errors, or purposeful fraudulent activity (aliases in the case of names).


The process of identifying common information and cleaning a dataset is called DataCleansing.
Different industries know this problem by a variety of names including Merge/Purge, De-Dupe and Record Linking, to name a few of the more familiar ones.

The DataCleanser is a new DataBlade that accomplishes the task efficiently over a single large database as well as a collection of heterogeneous databases gathered from different sources. For example, several lists of names of potential customers in a direct-marketing application gathered from credit bureaus, magazine subscription lists, and other sources can be easily and efficiently merged into one set of names that uniquely identifies an individual customer.

The DataCleanser has been rigorously evaluated against real-world data supplied by a Child Welfare Department. The DataCleanser has been shown to be accurate and effective when processing data with a variety of errors and duplicate information.

The DataCleanser provides a RULE PROGRAMMING MODULE that is easy to program and quite good at finding duplicates especially in an environment with errors that are specific to a specialized domain. Users are aided by a powerful Graphical User Interface to assist in the specification of rules that determine the equivalence and linkage between different records in the database.

The DataCleanser DataBlade's patented architecture is scalable to operate over small datasets and up to massive Warehouses of Information.


Electronic Digital Documents, Inc.
edd@npsa.com