The fundamental problem is that frequently data from different sources are either different in form or simply erroneous due to typographical or transcription errors, or purposeful fraudulent activity (aliases in the case of names).
The DataCleanser is a new DataBlade that accomplishes the task efficiently over a single large database as well as a collection of heterogeneous databases gathered from different sources. For example, several lists of names of potential customers in a direct-marketing application gathered from credit bureaus, magazine subscription lists, and other sources can be easily and efficiently merged into one set of names that uniquely identifies an individual customer.
The DataCleanser has been rigorously evaluated against real-world data supplied by a Child Welfare Department. The DataCleanser has been shown to be accurate and effective when processing data with a variety of errors and duplicate information.
The DataCleanser provides a RULE PROGRAMMING MODULE that is easy to program and quite good at finding duplicates especially in an environment with errors that are specific to a specialized domain. Users are aided by a powerful Graphical User Interface to assist in the specification of rules that determine the equivalence and linkage between different records in the database.
The DataCleanser DataBlade's patented architecture is scalable to operate over small datasets and up to massive Warehouses of Information.