Optimal training data selection for rule-based data cleansing models

Snigdha Chaturvedi; Tanveer A. Faruquie; L. Venkata Subramaniam; K. Hima Prasad; Girish Venkatachaliah; Sriram Padmanabhan

doi:10.1109/SRII.2011.25

SRII 2011

Conference paper

26 Aug 2011

Optimal training data selection for rule-based data cleansing models

View publication

Abstract

Enterprises today accumulate huge quantities of data which is often noisy and unstructured in nature making data cleansing an important task. Data cleansing refers to standardizing data from different sources to a common format so that data can be better utilized. Most of the enterprise data cleansing models are rule based involving lot of manual effort. Writing data quality rules is tedious task and often results in creation of erroneous rules because of the ambiguities that the data presents. a robust data cleansing model should be capable of handling a wide variety of records which is often dependant on the choice of the sample records knowledge engineer uses to write the rules. in this paper we present a method to select a diverse set of data records which when used to create the rule based data cleansing model can cover the maximum number of records. We also present a similarity metric between two records which help in choosing the diverse set of data samples. We also present a crowdsourcing based labeling mechanism to label the diverse records selected by the system so that collective intelligence of crowd can be used to eliminate the errors that occur in labeling sample data. We also present a method to select difficult set of diverse examples so that the crowd and the rule writer services can be effectively utilized to create a better cleansing model. We also present a method selection of such records for updating an existing rule set. We present the experimental results to show the effectiveness of the proposed methods. Results demonstrate an increase of 12% in the number of rules written, using this procedure. We also show that the method identifies records on which the existing model yields lower accuracy than on the records identified by other techniques; and thus identifies records that are more difficult to cleanse for the existing model. © 2011 IEEE.

Conference paper